Skip to main content

A Cloud-Native Serverless Approach for Implementation of Batch Extract-Load Processes in Data Lakes

  • Conference paper
  • First Online:
Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2020)

Abstract

The paper presents an approach to deal with batch extract-load processes for cloud data lakes. The approach combines multiple data ingestion techniques, provides advanced failover strategies and adopts cloud-native implementation. The suggested approach. The prototype implementation utilizes Amazon Web Services platform and is based on its serverless features. The approach can be implemented also using other cloud platforms like Google Cloud Platform or Microsoft Azure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Lenzerini, M.: Data integration: a theoretical perspective. In: Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Madison, Wisconsin, pp. 233–246 (2002). https://doi.org/10.1145/543613.543644

  2. Calvanese, D., de Giacomo, G., Lenzerini, M., Nardi, D.: Data integration in data warehousing. Int. J. Cooper. Inf. Syst. 10(3), 237–271 (2001)

    Article  Google Scholar 

  3. Davenport, R.J.: ETL vs ELT. A subjective view. Commercial aspects of BI. Insource House (2008)

    Google Scholar 

  4. Marín-Ortega, P., Dmitriyev, V., Abilov, M., Gómez, J.: ELTA: new approach in designing business intelligence solutions in era of big data. Proc. Technol. 16, 667–674 (2014). https://doi.org/10.1016/j.protcy.2014.10.015

    Article  Google Scholar 

  5. Miloslavskaya, N., Tolstoy, A.: Big data, fast data and data lake concepts. Proc. Comput. Sci. 88, 300–305 (2016). https://doi.org/10.1016/j.procs.2016.07.439

    Article  Google Scholar 

  6. Khine, P., Wang, Zh.: Data lake: a new ideology in big data era. In: ITM Web Conference, vol. 17, p. 03025 (2018). https://doi.org/10.1051/itmconf/20181703025

  7. Shepherd, A., et al.: Opportunities and challenges associated with implementing data lakes for enterprise decision-making. Issues Inf. Syst. 19(1), 48–57 (2018)

    Google Scholar 

  8. Munshi, A., Mohamed, Y.: Data lake lambda architecture for smart grids big data analytics. IEEE Access 6, 40463–40471 (2018). https://doi.org/10.1109/ACCESS.2018.2858256

    Article  Google Scholar 

  9. Pandey, S., Karunamoorthy, D., Buyya, R.: Workflow engine for clouds. In: Cloud Computing: Principles and Paradigms, chap. 12 (2011). https://doi.org/10.1002/9780470940105.ch12

  10. Malik, M.: Cloud computing-technologies. Int. J. Adv. Res. Comput. Sci. 9, 379–384 (2018)

    Article  Google Scholar 

  11. Gannon, D., Barga, R., Sundaresan, N.: Cloud-native applications. IEEE Cloud Comput. 4(5), 16–21 (2017). https://doi.org/10.1109/MCC.2017.4250939

    Article  Google Scholar 

  12. McGrath, G., Brenner, P.: Serverless computing: design, implementation, and performance. In: 2017 IEEE 37th International Conference on Distributed Computing Systems Workshops (ICDCSW), Atlanta, GA, pp. 405–410 (2017). https://doi.org/10.1109/ICDCSW.2017.36

  13. Baldini, I., et al.: Serverless computing: current trends and open problems. In: Chaudhary, S., Somani, G., Buyya, R. (eds.) Research Advances in Cloud Computing, pp. 1–20. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-5026-8_1

    Chapter  Google Scholar 

  14. Elgendy, N., Elragal, A.: Big data analytics: a literature review paper. In: Perner, P. (ed.) ICDM 2014. LNCS (LNAI), vol. 8557, pp. 214–227. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08976-8_16

    Chapter  Google Scholar 

  15. Jovanovic, P., Romero, O., Abelló, A.: A unified view of data-intensive flows in business intelligence systems: a survey. In: Hameurlain, A., Küng, J., Wagner, R. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXIX. LNCS, vol. 10120, pp. 66–107. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-54037-4_3

    Chapter  Google Scholar 

  16. Kim, Y., Lin, J.: Serverless data analytics with flint. In: 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), San Francisco, CA, pp. 451–455 (2018). https://doi.org/10.1109/CLOUD.2018.00063

  17. Gerber, A., le Roux, P., Kearney, C., van der Merwe, A.: The Zachman framework for enterprise architecture: an explanatory IS theory. In: Hattingh, M., Matthee, M., Smuts, H., Pappas, I., Dwivedi, Y.K., Mäntymäki, M. (eds.) I3E 2020. LNCS, vol. 12066, pp. 383–396. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-44999-5_32

    Chapter  Google Scholar 

  18. Zhou, G., Xie, Q., Hu, Y.: E-LT integration to heterogeneous data information for SMEs networking based on E-HUB. In: Fourth International Conference on Natural Computation, Jinan, pp. 212–216 (2008). https://doi.org/10.1109/ICNC.2008.77

  19. Sabtu, A., et al.: The challenges of extract, transform and loading (ETL) system implementation for near real-time environment. In: 2017 International Conference on Research and Innovation in Information Systems (ICRIIS), Langkawi, pp. 1–5 (2017). https://doi.org/10.1109/ICRIIS.2017.8002467

  20. Couvares, P., Kosar, T., Roy, A., Weber, J., Wenger, R.: Workflow in Condor. In: Taylor, I., Deelman, E., Gannon, D., Shields, M. (eds.) Workflows for e-Science. Springer, Heidelberg (2007)

    Google Scholar 

  21. Tannenbaum, T., Wright, D., Miller, K., Livny, M.: Condor: a distributed job scheduler. In: Beowulf Cluster Computing with Windows, pp. 307–350. MIT Press, Cambridge (2001)

    Google Scholar 

  22. Deelman, E., Blythe, J., Gil, Y., et al.: Mapping abstract complex workflows onto grid environments. J. Grid Comput. 1, 25–39 (2003). https://doi.org/10.1023/A:1024000426962

    Article  Google Scholar 

  23. Ludäscher, B., et al.: Scientific workflow management and the Kepler system. Concurr. Comput.: Pract. Exper. 18, 1039–1065 (2006). https://doi.org/10.1002/cpe.994

    Article  Google Scholar 

  24. Oinn, T., et al.: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17), 3045–3054 (2004). https://doi.org/10.1093/bioinformatics/bth361

    Article  Google Scholar 

  25. Jacob, J., et al.: Montage: a grid portal and software toolkit for science-grade astronomical image mosaicking. Int. J. Comput. Sci. Eng. (IJCSE) 4(2) (2009). https://doi.org/10.1504/IJCSE.2009.026999

  26. Abramovici, A., Althouse, W.: LIGO: The laser interferometer gravitational-wave observatory. Science 256(5055), 325–333 (1992). https://doi.org/10.1126/science.256.5055.325

    Article  Google Scholar 

  27. Deelman, E., et al.: Pegasus: mapping scientific workflows onto the grid. In: Dikaiakos, M.D. (ed.) AxGrids 2004. LNCS, vol. 3165, pp. 11–20. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28642-4_2

    Chapter  Google Scholar 

  28. Jiang, Q., Lee, Y.C., Zomaya, A.Y.: Serverless execution of scientific workflows. In: Maximilien, M., Vallecillo, A., Wang, J., Oriol, M. (eds.) ICSOC 2017. LNCS, vol. 10601, pp. 706–721. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69035-3_51

    Chapter  Google Scholar 

  29. Alonso, G., Hagen, C., Agrawal, D., El Abbadi, A., Mohan, C.: Enhancing the fault tolerance of workflow management systems. IEEE Concurr. 8(3), 74–81 (2000). https://doi.org/10.1109/4434.865896

    Article  Google Scholar 

  30. Couvares, P., Kosar, T., Roy, A., Weber, J., Wenger, K.: Workflow management in condor. In: Taylor, I.J., Deelman, E., Gannon, D.B., Shields, M. (eds.) Workflows for e-Science, pp. 357–375. Springer, London (2007). https://doi.org/10.1007/978-1-84628-757-2_22

    Chapter  Google Scholar 

  31. Sun, L., Franklin, M., Krishnan, S., Xin, R.: Fine-grained partitioning for aggressive data skipping. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD 2014), pp. 1115–1126. Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2588555.2610515

  32. Liu, X., Iftikhar, N.: Ontology-based big dimension modeling in data warehouse schema design. In: Abramowicz, W. (ed.) BIS 2013. LNBIP, vol. 157, pp. 75–87. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38366-3_7

    Chapter  Google Scholar 

  33. Mandagere, N., Zhou, P., Smith, M., Uttamchandani, S.: Demystifying data deduplication. In: Proceedings of the ACM/IFIP/USENIX Middleware 2008 Conference Companion (Companion 2008), pp. 12–17. Association for Computing Machinery, New York (2008). https://doi.org/10.1145/1462735.1462739

  34. Getoor, L., Machanavajjhala, A.: Entity resolution: theory, practice & open challenges. Proc. VLDB Endow. 5(12), 2018–2019 (2012). https://doi.org/10.14778/2367502.2367564

  35. Perez-artega P.A., Guzmán L., Denneulin Y.: Cost comparison of lambda architecture implementations for transportation analytics using public cloud software as a service. Spec. Session Softw. Eng. Serv. Cloud Comput. (2018). https://doi.org/10.5220/0006869308550862

Download references

Acknowledgments

The research is financially supported by the Russian Foundation for Basic Research, projects 18-07-01434, 18-29-22096. The research was carried out using infrastructure of shared research facilities CKP «Informatics» (http://www.frccsc.ru/ckp) of FRC CSC RAS.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sergey Stupnikov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bryzgalov, A., Stupnikov, S. (2021). A Cloud-Native Serverless Approach for Implementation of Batch Extract-Load Processes in Data Lakes. In: Sychev, A., Makhortov, S., Thalheim, B. (eds) Data Analytics and Management in Data Intensive Domains. DAMDID/RCDL 2020. Communications in Computer and Information Science, vol 1427. Springer, Cham. https://doi.org/10.1007/978-3-030-81200-3_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-81200-3_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-81199-0

  • Online ISBN: 978-3-030-81200-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics