ETLMR: A Highly Scalable Dimensional ETL Framework Based on MapReduce

Liu, Xiufeng; Thomsen, Christian; Pedersen, Torben Bach

doi:10.1007/978-3-642-37574-3_1

ETLMR: A Highly Scalable Dimensional ETL Framework Based on MapReduce

Xiufeng Liu²¹,
Christian Thomsen²¹ &
Torben Bach Pedersen²¹

Chapter

810 Accesses
8 Citations

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 7790))

Abstract

Extract-Transform-Load (ETL) flows periodically populate data warehouses (DWs) with data from different source systems. An increasing challenge for ETL flows is to process huge volumes of data quickly. MapReduce is establishing itself as the de-facto standard for large-scale data-intensive processing. However, MapReduce lacks support for high-level ETL specific constructs, resulting in low ETL programmer productivity. This paper presents a scalable dimensional ETL framework, ETLMR, based on MapReduce. ETLMR has built-in native support for operations on DW-specific constructs such as star schemas, snowflake schemas and slowly changing dimensions (SCDs). This enables ETL developers to construct scalable MapReduce-based ETL flows with very few code lines. To achieve good performance and load balancing, a number of dimension and fact processing schemes are presented, including techniques for efficiently processing different types of dimensions. The paper describes the integration of ETLMR with a MapReduce framework and evaluates its performance on large realistic data sets. The experimental results show that ETLMR achieves very good scalability and compares favourably with other MapReduce data warehousing tools.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB 2(1), 22–933 (2009)
Google Scholar
AsterData (September 10, 2012), www.asterdata.com
Applications and organizations using Hadoop (September 10, 2012), wiki.apache.org/hadoop/PoweredBy
Disco project (September 10, 2012), discoproject.org
Shelve - Python object persistence (September 10, 2012), docs.python.org/library/shelve.html
The Apache Hadoop Project (October 6, 2011), hadoop.apache.org
(September 10, 2012), www.pentaho.com
Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. PVLDB 1(2), 1265–1276 (2008)
Google Scholar
Chen, S.: Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce. PVLDB 3(1), 1459–1468 (2010)
Google Scholar
Cuzzocrea, A., Song, I.Y., Davis, K.C.: Analytics Over Large-scale Multidimensional Data: the Big Data Revolution! In: Proc. of the ACM 14th International Workshop on Data Warehousing and OLAP, pp. 101–104 (2011)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: A Flexible Data Processing Tool. CACM 53(1), 72–77 (2010)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proc. of OSDI, pp. 137–150 (2004)
Google Scholar
Dittrich, J., Quiane-Ruiz, J.A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). PVLDB 3(1) (2010)
Google Scholar
DeWitt, D., Robinson, E., Shankar, S., Paulson, E., Naughton, J., Krioukov, A., Royalty, J.: Clustera: An Integrated Computation and Data Management System. PVLDB 1(1), 28–41 (2008)
Google Scholar
Friedman, E., Pawlowski, P., Cieslewicz, J.: SQL/MapReduce: A Practical Approach to Self-describing, Polymorphic, and Parallelizable User-defined Functions. PVLDB 2(2), 1402–1413 (2009)
Google Scholar
GreenPlum (September 10, 2012), www.greenplum.com
Kovoor, G., Singer, J., Lujan, M.: Building a Java MapReduce Framework for Multi-core Architectures. In: Proc. of MULTIPROG (2010)
Google Scholar
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In: Proc. of EuroSys, pp. 59–72 (2007)
Google Scholar
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: A Not-so-foreign Language for Data Processing. In: Proc. of SIGMOD, pp. 1099–1110 (2008)
Google Scholar
Pavlo, A., Paulson, E., Rasin, A., Abadi, D., DeWitt, D., Madden, S., Stonebraker, M.: A Comparison of Approaches to Large-scale Data Analysis. In: Proc. of SIGMOD, pp. 165–178 (2009)
Google Scholar
Peng, D., Dabek, F.: Large-scale Incremental Processing Using Distributed Transactions and Notifications. In: Proc. of OSDI, pp. 251–264 (2010)
Google Scholar
Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating MapReduce for Multi-core and Multiprocessor Systems. In: Proc. of HPCA, pp. 13–24 (2007)
Google Scholar
Stonebraker, M., Abadi, D., DeWitt, D., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and Parallel DBMSs: friends or foes? CACM 53(1), 64–71 (2010)
Article Google Scholar
Thomsen, C., Pedersen, T.B.: Building a Web Warehouse for Accessibility Data. In: Proc. of DOLAP, pp. 43–50 (2009)
Google Scholar
Thomsen, C., Pedersen, T.B.: pygrametl: A Powerful Programming Framework for Extract-Transform-Load Programmers. In: Proc. of DOLAP, pp. 49–56 (2009)
Google Scholar
Thusoo, A., Sarma, J., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: A Warehousing Solution Over a Map-reduce Framework. PVLDB 2(2), 1626–1629 (2009)
Google Scholar
Thusoo, A., Sarma, J., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., Murthy, R.: Hive-A Petabyte Scale Data Warehouse Using Hadoop. In: Proc. of ICDE, pp. 996–1005 (2010)
Google Scholar
TPC-H (September 10, 2012), http://tpc.org/tpch/
Vassiliadis, P., Simitsis, A.: Near Real Time ETL. In: Kozielski, S., Wrembel, R. (eds.) New Trends in Data Warehousing and Data Analysis, pp. 1–31. Springer (2008)
Google Scholar
Yoo, R., Romano, A., Kozyrakis, C.: Phoenix Rebirth: Scalable MapReduce on a Large-scale Shared-memory System. In: Proc. of IISWC, pp. 198–207 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Aalborg University, Denmark
Xiufeng Liu, Christian Thomsen & Torben Bach Pedersen

Authors

Xiufeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Christian Thomsen
View author publications
You can also search for this author in PubMed Google Scholar
Torben Bach Pedersen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IRIT, Paul Sabatier University,, 118 route de Narbonne, 31062, Toulouse Cedex, France
Abdelkader Hameurlain
Institute for Application Oriented Knowledge Processing, 4020, Linz, Austria
Josef Küng
FAW, University of Linz, Altenbergerstraße 69, 4040, Linz, Austria
Roland Wagner
ICAR-CNR, University of Calabria, via P. Bucci 41C, 87036, Rende (CS), Italy
Alfredo Cuzzocrea
Hewlett-Packard Laboratories, 1501 Page Mill Road, 94304, Palo Alto, CA, USA
Umeshwar Dayal

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Liu, X., Thomsen, C., Pedersen, T.B. (2013). ETLMR: A Highly Scalable Dimensional ETL Framework Based on MapReduce. In: Hameurlain, A., Küng, J., Wagner, R., Cuzzocrea, A., Dayal, U. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems VIII. Lecture Notes in Computer Science, vol 7790. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37574-3_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-37574-3_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37573-6
Online ISBN: 978-3-642-37574-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics