A Cloud-Native Serverless Approach for Implementation of Batch Extract-Load Processes in Data Lakes

Bryzgalov, Anton; Stupnikov, Sergey

doi:10.1007/978-3-030-81200-3_3

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1427))

Included in the following conference series:

International Conference on Data Analytics and Management in Data Intensive Domains

294 Accesses

Abstract

The paper presents an approach to deal with batch extract-load processes for cloud data lakes. The approach combines multiple data ingestion techniques, provides advanced failover strategies and adopts cloud-native implementation. The suggested approach. The prototype implementation utilizes Amazon Web Services platform and is based on its serverless features. The approach can be implemented also using other cloud platforms like Google Cloud Platform or Microsoft Azure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Integration of ETL in Cloud Using Spark for Streaming Data

Elastic Resource Provisioning for Batched Stream Processing System in Container Cloud

Orchestration Tools for Big Data

References

Lenzerini, M.: Data integration: a theoretical perspective. In: Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Madison, Wisconsin, pp. 233–246 (2002). https://doi.org/10.1145/543613.543644
Calvanese, D., de Giacomo, G., Lenzerini, M., Nardi, D.: Data integration in data warehousing. Int. J. Cooper. Inf. Syst. 10(3), 237–271 (2001)
Article Google Scholar
Davenport, R.J.: ETL vs ELT. A subjective view. Commercial aspects of BI. Insource House (2008)
Google Scholar
Marín-Ortega, P., Dmitriyev, V., Abilov, M., Gómez, J.: ELTA: new approach in designing business intelligence solutions in era of big data. Proc. Technol. 16, 667–674 (2014). https://doi.org/10.1016/j.protcy.2014.10.015
Article Google Scholar
Miloslavskaya, N., Tolstoy, A.: Big data, fast data and data lake concepts. Proc. Comput. Sci. 88, 300–305 (2016). https://doi.org/10.1016/j.procs.2016.07.439
Article Google Scholar
Khine, P., Wang, Zh.: Data lake: a new ideology in big data era. In: ITM Web Conference, vol. 17, p. 03025 (2018). https://doi.org/10.1051/itmconf/20181703025
Shepherd, A., et al.: Opportunities and challenges associated with implementing data lakes for enterprise decision-making. Issues Inf. Syst. 19(1), 48–57 (2018)
Google Scholar
Munshi, A., Mohamed, Y.: Data lake lambda architecture for smart grids big data analytics. IEEE Access 6, 40463–40471 (2018). https://doi.org/10.1109/ACCESS.2018.2858256
Article Google Scholar
Pandey, S., Karunamoorthy, D., Buyya, R.: Workflow engine for clouds. In: Cloud Computing: Principles and Paradigms, chap. 12 (2011). https://doi.org/10.1002/9780470940105.ch12
Malik, M.: Cloud computing-technologies. Int. J. Adv. Res. Comput. Sci. 9, 379–384 (2018)
Article Google Scholar
Gannon, D., Barga, R., Sundaresan, N.: Cloud-native applications. IEEE Cloud Comput. 4(5), 16–21 (2017). https://doi.org/10.1109/MCC.2017.4250939
Article Google Scholar
McGrath, G., Brenner, P.: Serverless computing: design, implementation, and performance. In: 2017 IEEE 37th International Conference on Distributed Computing Systems Workshops (ICDCSW), Atlanta, GA, pp. 405–410 (2017). https://doi.org/10.1109/ICDCSW.2017.36
Baldini, I., et al.: Serverless computing: current trends and open problems. In: Chaudhary, S., Somani, G., Buyya, R. (eds.) Research Advances in Cloud Computing, pp. 1–20. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-5026-8_1
Chapter Google Scholar
Elgendy, N., Elragal, A.: Big data analytics: a literature review paper. In: Perner, P. (ed.) ICDM 2014. LNCS (LNAI), vol. 8557, pp. 214–227. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08976-8_16
Chapter Google Scholar
Jovanovic, P., Romero, O., Abelló, A.: A unified view of data-intensive flows in business intelligence systems: a survey. In: Hameurlain, A., Küng, J., Wagner, R. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXIX. LNCS, vol. 10120, pp. 66–107. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-54037-4_3
Chapter Google Scholar
Kim, Y., Lin, J.: Serverless data analytics with flint. In: 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), San Francisco, CA, pp. 451–455 (2018). https://doi.org/10.1109/CLOUD.2018.00063
Gerber, A., le Roux, P., Kearney, C., van der Merwe, A.: The Zachman framework for enterprise architecture: an explanatory IS theory. In: Hattingh, M., Matthee, M., Smuts, H., Pappas, I., Dwivedi, Y.K., Mäntymäki, M. (eds.) I3E 2020. LNCS, vol. 12066, pp. 383–396. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-44999-5_32
Chapter Google Scholar
Zhou, G., Xie, Q., Hu, Y.: E-LT integration to heterogeneous data information for SMEs networking based on E-HUB. In: Fourth International Conference on Natural Computation, Jinan, pp. 212–216 (2008). https://doi.org/10.1109/ICNC.2008.77
Sabtu, A., et al.: The challenges of extract, transform and loading (ETL) system implementation for near real-time environment. In: 2017 International Conference on Research and Innovation in Information Systems (ICRIIS), Langkawi, pp. 1–5 (2017). https://doi.org/10.1109/ICRIIS.2017.8002467
Couvares, P., Kosar, T., Roy, A., Weber, J., Wenger, R.: Workflow in Condor. In: Taylor, I., Deelman, E., Gannon, D., Shields, M. (eds.) Workflows for e-Science. Springer, Heidelberg (2007)
Google Scholar
Tannenbaum, T., Wright, D., Miller, K., Livny, M.: Condor: a distributed job scheduler. In: Beowulf Cluster Computing with Windows, pp. 307–350. MIT Press, Cambridge (2001)
Google Scholar
Deelman, E., Blythe, J., Gil, Y., et al.: Mapping abstract complex workflows onto grid environments. J. Grid Comput. 1, 25–39 (2003). https://doi.org/10.1023/A:1024000426962
Article Google Scholar
Ludäscher, B., et al.: Scientific workflow management and the Kepler system. Concurr. Comput.: Pract. Exper. 18, 1039–1065 (2006). https://doi.org/10.1002/cpe.994
Article Google Scholar
Oinn, T., et al.: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17), 3045–3054 (2004). https://doi.org/10.1093/bioinformatics/bth361
Article Google Scholar
Jacob, J., et al.: Montage: a grid portal and software toolkit for science-grade astronomical image mosaicking. Int. J. Comput. Sci. Eng. (IJCSE) 4(2) (2009). https://doi.org/10.1504/IJCSE.2009.026999
Abramovici, A., Althouse, W.: LIGO: The laser interferometer gravitational-wave observatory. Science 256(5055), 325–333 (1992). https://doi.org/10.1126/science.256.5055.325
Article Google Scholar
Deelman, E., et al.: Pegasus: mapping scientific workflows onto the grid. In: Dikaiakos, M.D. (ed.) AxGrids 2004. LNCS, vol. 3165, pp. 11–20. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28642-4_2
Chapter Google Scholar
Jiang, Q., Lee, Y.C., Zomaya, A.Y.: Serverless execution of scientific workflows. In: Maximilien, M., Vallecillo, A., Wang, J., Oriol, M. (eds.) ICSOC 2017. LNCS, vol. 10601, pp. 706–721. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69035-3_51
Chapter Google Scholar
Alonso, G., Hagen, C., Agrawal, D., El Abbadi, A., Mohan, C.: Enhancing the fault tolerance of workflow management systems. IEEE Concurr. 8(3), 74–81 (2000). https://doi.org/10.1109/4434.865896
Article Google Scholar
Couvares, P., Kosar, T., Roy, A., Weber, J., Wenger, K.: Workflow management in condor. In: Taylor, I.J., Deelman, E., Gannon, D.B., Shields, M. (eds.) Workflows for e-Science, pp. 357–375. Springer, London (2007). https://doi.org/10.1007/978-1-84628-757-2_22
Chapter Google Scholar
Sun, L., Franklin, M., Krishnan, S., Xin, R.: Fine-grained partitioning for aggressive data skipping. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD 2014), pp. 1115–1126. Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2588555.2610515
Liu, X., Iftikhar, N.: Ontology-based big dimension modeling in data warehouse schema design. In: Abramowicz, W. (ed.) BIS 2013. LNBIP, vol. 157, pp. 75–87. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38366-3_7
Chapter Google Scholar
Mandagere, N., Zhou, P., Smith, M., Uttamchandani, S.: Demystifying data deduplication. In: Proceedings of the ACM/IFIP/USENIX Middleware 2008 Conference Companion (Companion 2008), pp. 12–17. Association for Computing Machinery, New York (2008). https://doi.org/10.1145/1462735.1462739
Getoor, L., Machanavajjhala, A.: Entity resolution: theory, practice & open challenges. Proc. VLDB Endow. 5(12), 2018–2019 (2012). https://doi.org/10.14778/2367502.2367564
Perez-artega P.A., Guzmán L., Denneulin Y.: Cost comparison of lambda architecture implementations for transportation analytics using public cloud software as a service. Spec. Session Softw. Eng. Serv. Cloud Comput. (2018). https://doi.org/10.5220/0006869308550862

Download references

Acknowledgments

The research is financially supported by the Russian Foundation for Basic Research, projects 18-07-01434, 18-29-22096. The research was carried out using infrastructure of shared research facilities CKP «Informatics» (http://www.frccsc.ru/ckp) of FRC CSC RAS.

Author information

Authors and Affiliations

Lomonosov Moscow State University, Moscow, Russia
Anton Bryzgalov
Institute of Informatics Problems, Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, Moscow, Russia
Sergey Stupnikov

Authors

Anton Bryzgalov
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Stupnikov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sergey Stupnikov .

Editor information

Editors and Affiliations

Voronezh State University, Voronezh, Russia
Alexander Sychev
Voronezh State University, Voronezh, Russia
Sergey Makhortov
Christian-Albrecht University of Kiel, Kiel, Schleswig-Holstein, Germany
Bernhard Thalheim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bryzgalov, A., Stupnikov, S. (2021). A Cloud-Native Serverless Approach for Implementation of Batch Extract-Load Processes in Data Lakes. In: Sychev, A., Makhortov, S., Thalheim, B. (eds) Data Analytics and Management in Data Intensive Domains. DAMDID/RCDL 2020. Communications in Computer and Information Science, vol 1427. Springer, Cham. https://doi.org/10.1007/978-3-030-81200-3_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-81200-3_3
Published: 16 July 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-81199-0
Online ISBN: 978-3-030-81200-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Cloud-Native Serverless Approach for Implementation of Batch Extract-Load Processes in Data Lakes

Abstract

Access this chapter

Similar content being viewed by others

Integration of ETL in Cloud Using Spark for Streaming Data

Elastic Resource Provisioning for Batched Stream Processing System in Container Cloud

Orchestration Tools for Big Data

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Cloud-Native Serverless Approach for Implementation of Batch Extract-Load Processes in Data Lakes

Abstract

Access this chapter

Similar content being viewed by others

Integration of ETL in Cloud Using Spark for Streaming Data

Elastic Resource Provisioning for Batched Stream Processing System in Container Cloud

Orchestration Tools for Big Data

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation