Encyclopedia of Big Data Technologies

2019 Edition
| Editors: Sherif Sakr, Albert Y. Zomaya


  • Christian ThomsenEmail author
Reference work entry
DOI: https://doi.org/10.1007/978-3-319-77525-8_11



ETL is short for Extract-Transform-Load. The ETL process extracts data from operational source systems, transforms the data, and loads the data into a target. The transformations to perform on the data can involve a plethora of different activities, e.g., filtering, normalization or de-normalization to a desired form, joins, conversion, and cleansing to remove bad or dirty data. In the ELT variant, the data is extracted from the source systems, loaded in its raw form into the target, and then transformed.


The term ETL process has traditionally been used for a process that populates a data warehouse (DW) managed by a relational database management system (RDBMS). As pointed out by Simitsis and Vassiliadis (2017), the basic concept of populating a data store with data reshaped from another data store is, however, older than data warehousing. The ETL process can be hand-coded or made with a designated ETL tool where the...

This is a preview of subscription content, log in to check access.


  1. Akidau T, Bradshaw R, Chambers C, Chernyak S, Fernández-Moctezuma RJ, Lax R, McVeety S, Mills D, Perry F, Schmidt E et al (2015) The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc VLDB Endow 8(12):1792–1803CrossRefGoogle Scholar
  2. Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A et al (2015) Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data. ACM, pp 1383–1394Google Scholar
  3. Dageville B, Cruanes T, Zukowski M, Antonov V, Avanes A, Bock J, Claybaugh J, Engovatov D, Hentschel M, Huang J, Lee AW, Motivala A, Munir AQ, Pelley S, Povinec P, Rahn G, Triantafyllis S, Unterbrunner P (2016) The snowflake elastic data warehouse. In: Proceedings of the 2016 international conference on management of data, SIGMOD’16, New York. ACM, pp 215–226. ISBN:978-1-4503-3531-7. http://doi.acm.org/10.1145/2882903.2903741CrossRefGoogle Scholar
  4. Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: 6th symposium on operating system design and implementation (OSDI 2004), San Francisco, 6–8 Dec 2004, pp 137–150. http://www.usenix.org/events/osdi04/tech/dean.html
  5. Gupta A, Agarwal D, Tan D, Kulesza J, Pathak R, Stefani S, Srinivasan V (2015) Amazon redshift and the case for simpler data warehouses. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, SIGMOD’15, New York. ACM, pp 1917–1923. ISBN:978-1-4503-2758-9. http://doi.acm.org/10.1145/2723372.2742795CrossRefGoogle Scholar
  6. Kandel S, Paepcke A, Hellerstein J, Heer J (2011) Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 3363–3372Google Scholar
  7. Khayyat Z, Ilyas IF, Jindal A, Madden S, Ouzzani M, Papotti P, Quiané-Ruiz J-A, Tang N, Yin S (2015) Bigdansing: a system for big data cleansing. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data. ACM, pp 1215–1230Google Scholar
  8. Kimball R (2008) The data warehouse lifecycle toolkit. Wiley, HobokenGoogle Scholar
  9. Liu X, Thomsen C, Pedersen TB (2013) Etlmr: a highly scalable dimensional ETL framework based on mapreduce. In: Hameurlain A, Küng J, Wagner RR (eds) Transactions on large-scale data-and knowledge-centered systems VIII. Springer, Heidelberg/New York, pp 1–31Google Scholar
  10. Liu X, Thomsen C, Pedersen TB (2014) Cloudetl: scalable dimensional ETL for hive. In: Proceedings of the 18th international database engineering & applications symposium. ACM, pp 195–206Google Scholar
  11. Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data. ACM, pp 1099–1110Google Scholar
  12. Özcan F, Hoa D, Beyer KS, Balmin A, Liu CJ, Li Y (2011) Emerging trends in the enterprise data analytics: connecting hadoop and db2 warehouse. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data. ACM, pp 1161–1164Google Scholar
  13. Pike R, Dorward S, Griesemer R, Quinlan S (2005) Interpreting the data: Parallel analysis with Sawzall. Sci Program 13(4):277–298Google Scholar
  14. Simitsis A, Vassiliadis P (2017) Extraction, transformation, and loading. Springer, New York, pp 1–9. ISBN 978-1-4899-7993-3. https://doi.org/10.1007/978-1-4899-7993-3_158-3CrossRefGoogle Scholar
  15. Stonebraker M, Bruckner D, Ilyas IF, Beskales G, Cherniack M, Zdonik SB, Pagan A, Xu S (2013) Data curation at scale: the data tamer system. In CIDRGoogle Scholar
  16. Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive: a warehousing solution over a map-reduce framework. Proc VLDB Endow 2(2):1626–1629CrossRefGoogle Scholar
  17. Tigani J, Naidu S (2014) Google BigQuery analytics. Wiley, IndianapolisGoogle Scholar
  18. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S (2012) Stoica I resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation. USENIX Association, pp 2–2Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Computer ScienceAalborg UniversityAalborgDenmark