Skip to main content

ETLMR: A Highly Scalable Dimensional ETL Framework Based on MapReduce

  • Chapter

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 7790))

Abstract

Extract-Transform-Load (ETL) flows periodically populate data warehouses (DWs) with data from different source systems. An increasing challenge for ETL flows is to process huge volumes of data quickly. MapReduce is establishing itself as the de-facto standard for large-scale data-intensive processing. However, MapReduce lacks support for high-level ETL specific constructs, resulting in low ETL programmer productivity. This paper presents a scalable dimensional ETL framework, ETLMR, based on MapReduce. ETLMR has built-in native support for operations on DW-specific constructs such as star schemas, snowflake schemas and slowly changing dimensions (SCDs). This enables ETL developers to construct scalable MapReduce-based ETL flows with very few code lines. To achieve good performance and load balancing, a number of dimension and fact processing schemes are presented, including techniques for efficiently processing different types of dimensions. The paper describes the integration of ETLMR with a MapReduce framework and evaluates its performance on large realistic data sets. The experimental results show that ETLMR achieves very good scalability and compares favourably with other MapReduce data warehousing tools.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB 2(1), 22–933 (2009)

    Google Scholar 

  2. AsterData (September 10, 2012), www.asterdata.com

  3. Applications and organizations using Hadoop (September 10, 2012), wiki.apache.org/hadoop/PoweredBy

  4. Disco project (September 10, 2012), discoproject.org

  5. Shelve - Python object persistence (September 10, 2012), docs.python.org/library/shelve.html

  6. The Apache Hadoop Project (October 6, 2011), hadoop.apache.org

  7. (September 10, 2012), www.pentaho.com

  8. Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. PVLDB 1(2), 1265–1276 (2008)

    Google Scholar 

  9. Chen, S.: Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce. PVLDB 3(1), 1459–1468 (2010)

    Google Scholar 

  10. Cuzzocrea, A., Song, I.Y., Davis, K.C.: Analytics Over Large-scale Multidimensional Data: the Big Data Revolution! In: Proc. of the ACM 14th International Workshop on Data Warehousing and OLAP, pp. 101–104 (2011)

    Google Scholar 

  11. Dean, J., Ghemawat, S.: MapReduce: A Flexible Data Processing Tool. CACM 53(1), 72–77 (2010)

    Article  Google Scholar 

  12. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proc. of OSDI, pp. 137–150 (2004)

    Google Scholar 

  13. Dittrich, J., Quiane-Ruiz, J.A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). PVLDB 3(1) (2010)

    Google Scholar 

  14. DeWitt, D., Robinson, E., Shankar, S., Paulson, E., Naughton, J., Krioukov, A., Royalty, J.: Clustera: An Integrated Computation and Data Management System. PVLDB 1(1), 28–41 (2008)

    Google Scholar 

  15. Friedman, E., Pawlowski, P., Cieslewicz, J.: SQL/MapReduce: A Practical Approach to Self-describing, Polymorphic, and Parallelizable User-defined Functions. PVLDB 2(2), 1402–1413 (2009)

    Google Scholar 

  16. GreenPlum (September 10, 2012), www.greenplum.com

  17. Kovoor, G., Singer, J., Lujan, M.: Building a Java MapReduce Framework for Multi-core Architectures. In: Proc. of MULTIPROG (2010)

    Google Scholar 

  18. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In: Proc. of EuroSys, pp. 59–72 (2007)

    Google Scholar 

  19. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: A Not-so-foreign Language for Data Processing. In: Proc. of SIGMOD, pp. 1099–1110 (2008)

    Google Scholar 

  20. Pavlo, A., Paulson, E., Rasin, A., Abadi, D., DeWitt, D., Madden, S., Stonebraker, M.: A Comparison of Approaches to Large-scale Data Analysis. In: Proc. of SIGMOD, pp. 165–178 (2009)

    Google Scholar 

  21. Peng, D., Dabek, F.: Large-scale Incremental Processing Using Distributed Transactions and Notifications. In: Proc. of OSDI, pp. 251–264 (2010)

    Google Scholar 

  22. Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating MapReduce for Multi-core and Multiprocessor Systems. In: Proc. of HPCA, pp. 13–24 (2007)

    Google Scholar 

  23. Stonebraker, M., Abadi, D., DeWitt, D., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and Parallel DBMSs: friends or foes? CACM 53(1), 64–71 (2010)

    Article  Google Scholar 

  24. Thomsen, C., Pedersen, T.B.: Building a Web Warehouse for Accessibility Data. In: Proc. of DOLAP, pp. 43–50 (2009)

    Google Scholar 

  25. Thomsen, C., Pedersen, T.B.: pygrametl: A Powerful Programming Framework for Extract-Transform-Load Programmers. In: Proc. of DOLAP, pp. 49–56 (2009)

    Google Scholar 

  26. Thusoo, A., Sarma, J., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: A Warehousing Solution Over a Map-reduce Framework. PVLDB 2(2), 1626–1629 (2009)

    Google Scholar 

  27. Thusoo, A., Sarma, J., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., Murthy, R.: Hive-A Petabyte Scale Data Warehouse Using Hadoop. In: Proc. of ICDE, pp. 996–1005 (2010)

    Google Scholar 

  28. TPC-H (September 10, 2012), http://tpc.org/tpch/

  29. Vassiliadis, P., Simitsis, A.: Near Real Time ETL. In: Kozielski, S., Wrembel, R. (eds.) New Trends in Data Warehousing and Data Analysis, pp. 1–31. Springer (2008)

    Google Scholar 

  30. Yoo, R., Romano, A., Kozyrakis, C.: Phoenix Rebirth: Scalable MapReduce on a Large-scale Shared-memory System. In: Proc. of IISWC, pp. 198–207 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Liu, X., Thomsen, C., Pedersen, T.B. (2013). ETLMR: A Highly Scalable Dimensional ETL Framework Based on MapReduce. In: Hameurlain, A., Küng, J., Wagner, R., Cuzzocrea, A., Dayal, U. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems VIII. Lecture Notes in Computer Science, vol 7790. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37574-3_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-37574-3_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-37573-6

  • Online ISBN: 978-3-642-37574-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics