Advertisement

Programmatic ETL

  • Christian Thomsen
  • Ove Andersen
  • Søren Kejser Jensen
  • Torben Bach Pedersen
Conference paper
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 324)

Abstract

Extract-Transform-Load (ETL) processes are used for extracting data, transforming it and loading it into data warehouses (DWs). The dominating ETL tools use graphical user interfaces (GUIs) such that the developer “draws” the ETL flow by connecting steps/transformations with lines. This gives an easy overview, but can also be rather tedious and require much trivial work for simple things. We therefore challenge this approach and propose to do ETL programming by writing code. To make the programming easy, we present the Python-based framework pygrametl which offers commonly used functionality for ETL development. By using the framework, the developer can efficiently create effective ETL solutions from which the full power of programming can be exploited. In this chapter, we present our work on pygrametl and related activities. Further, we consider some of the lessons learned during the development of pygrametl as an open source framework.

References

  1. 1.
    Beyer, M.A., Thoo, E., Selvage, M.Y., Zaidi, E.: Gartner Magic Quadrant for Data Integration Tools (2017)Google Scholar
  2. 2.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the OSDI, pp. 137–150 (2004).  https://doi.org/10.1145/1327452.1327492
  3. 3.
    Django. djangoproject.com/. Accessed 13 Oct 2017
  4. 4.
    Grönniger, H., Krahn, H., Rumpe, B., Schindler, M., Völkel, S.: Text-based modeling. In: Proceedings of ATEM (2007)Google Scholar
  5. 5.
    IBM InfoSphere DataStage. https://www.ibm.com/ms-en/marketplace/datastage. Accessed 13 Oct 2017
  6. 6.
    Informatica. informatica.com. Accessed 13 Oct 2017
  7. 7.
    Jensen, C.S., Pedersen, T.B., Thomsen, C.: Multidimensional Databases and Data Warehousing. Morgan and Claypool, San Rafael (2010).  https://doi.org/10.2200/S00299ED1V01Y201009DTM009CrossRefzbMATHGoogle Scholar
  8. 8.
    Kimball, R., Ross, M.: The Data Warehouse Toolkit, 2nd edn. Wiley, New York (2002)Google Scholar
  9. 9.
    Liu, X., Thomsen, C., Pedersen, T.B.: ETLMR: a highly scalable dimensional ETL framework based on MapReduce. In: Cuzzocrea, A., Dayal, U. (eds.) DaWaK 2011. LNCS, vol. 6862, pp. 96–111. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-23544-3_8CrossRefGoogle Scholar
  10. 10.
    Microsoft SQL Server Integration Services. https://docs.microsoft.com/en-us/sql/integration-services/sql-server-integration-services. Accessed 13 Oct 2017
  11. 11.
  12. 12.
    Pentaho Data Integration - Kettle. http://kettle.pentaho.org. Accessed 13 Oct 2017
  13. 13.
    Petre, M.: Why looking isn’t always seeing: readership skills and graphical programming. Commun. ACM 38(6), 33–44 (1995).  https://doi.org/10.1145/203241.203251CrossRefGoogle Scholar
  14. 14.
    PostgreSQL. postgresql.org. Accessed 13 Oct 2017Google Scholar
  15. 15.
    Psycopg. http://initd.org/psycopg/. Accessed 13 Oct 2017
  16. 16.
    Python. python.org. Accessed 13 Oct 2017Google Scholar
  17. 17.
    Ruby on Rails. rubyonrails.org/. Accessed 13 Oct 2017Google Scholar
  18. 18.
    SAP Data Services. https://www.sap.com/products/data-services.html. Accessed 13 Oct 2017
  19. 19.
    Scriptella. scriptella.org. Accessed 13 Oct 2017Google Scholar
  20. 20.
    Simitsis, A., Vassiliadis, P., Terrovitis, M., Skiadopoulos, S.: Graph-based modeling of ETL activities with multi-level transformations and updates. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2005. LNCS, vol. 3589, pp. 43–52. Springer, Heidelberg (2005).  https://doi.org/10.1007/11546849_5CrossRefGoogle Scholar
  21. 21.
    Thomsen, C., Pedersen, T.B.: Building a web warehouse for accessibility data. In: Proceedings of DOLAP (2006).  https://doi.org/10.1145/1183512.1183522
  22. 22.
    Thomsen, C., Pedersen, T.B.: A survey of open source tools for business intelligence. IJDWM 5(3), 56–75 (2009).  https://doi.org/10.4018/jdwm.2009070103CrossRefGoogle Scholar
  23. 23.
    Thomsen, C., Pedersen, T.B.: pygrametl: a powerful programming framework for extract-transform-load programmers. In: Proceedings of DOLAP, pp. 49–56 (2009).  https://doi.org/10.1145/2064676.2064684
  24. 24.
    Thomsen, C., Pedersen, T.B.: pygrametl: a powerful programming framework for extract-transform-load programmers. DBTR-25, Aalborg University (2009). www.cs.aau.dk/DBTR
  25. 25.
    Thomsen, C., Pedersen, T.B.: Easy and effective parallel programmable ETL. In: Proceedings of DOLAP, pp. 37–44 (2011)Google Scholar
  26. 26.
    Trujillo, J., Luján-Mora, S.: A UML based approach for modeling ETL processes in data warehouses. In: Song, I.-Y., Liddle, S.W., Ling, T.-W., Scheuermann, P. (eds.) ER 2003. LNCS, vol. 2813, pp. 307–320. Springer, Heidelberg (2003).  https://doi.org/10.1007/978-3-540-39648-2_25CrossRefGoogle Scholar
  27. 27.
    Vaisman, A., Zimanyi, E.: Data Warehouse Systems: Design and Implementation. Springer, Heidelberg (2014).  https://doi.org/10.1007/978-3-642-54655-6CrossRefGoogle Scholar
  28. 28.
    Vassiliadis, P.: A survey of extract-transform-load technology. IJDWM 5(3), 1–27 (2009).  https://doi.org/10.4018/jdwm.2009070101CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Christian Thomsen
    • 1
  • Ove Andersen
    • 1
  • Søren Kejser Jensen
    • 1
  • Torben Bach Pedersen
    • 1
  1. 1.Department of Computer ScienceAalborg UniversityAalborgDenmark

Personalised recommendations