Abstract
The paper presents an approach to deal with batch extract-load processes for cloud data lakes. The approach combines multiple data ingestion techniques, provides advanced failover strategies and adopts cloud-native implementation. The suggested approach. The prototype implementation utilizes Amazon Web Services platform and is based on its serverless features. The approach can be implemented also using other cloud platforms like Google Cloud Platform or Microsoft Azure.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Lenzerini, M.: Data integration: a theoretical perspective. In: Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Madison, Wisconsin, pp. 233–246 (2002). https://doi.org/10.1145/543613.543644
Calvanese, D., de Giacomo, G., Lenzerini, M., Nardi, D.: Data integration in data warehousing. Int. J. Cooper. Inf. Syst. 10(3), 237–271 (2001)
Davenport, R.J.: ETL vs ELT. A subjective view. Commercial aspects of BI. Insource House (2008)
Marín-Ortega, P., Dmitriyev, V., Abilov, M., Gómez, J.: ELTA: new approach in designing business intelligence solutions in era of big data. Proc. Technol. 16, 667–674 (2014). https://doi.org/10.1016/j.protcy.2014.10.015
Miloslavskaya, N., Tolstoy, A.: Big data, fast data and data lake concepts. Proc. Comput. Sci. 88, 300–305 (2016). https://doi.org/10.1016/j.procs.2016.07.439
Khine, P., Wang, Zh.: Data lake: a new ideology in big data era. In: ITM Web Conference, vol. 17, p. 03025 (2018). https://doi.org/10.1051/itmconf/20181703025
Shepherd, A., et al.: Opportunities and challenges associated with implementing data lakes for enterprise decision-making. Issues Inf. Syst. 19(1), 48–57 (2018)
Munshi, A., Mohamed, Y.: Data lake lambda architecture for smart grids big data analytics. IEEE Access 6, 40463–40471 (2018). https://doi.org/10.1109/ACCESS.2018.2858256
Pandey, S., Karunamoorthy, D., Buyya, R.: Workflow engine for clouds. In: Cloud Computing: Principles and Paradigms, chap. 12 (2011). https://doi.org/10.1002/9780470940105.ch12
Malik, M.: Cloud computing-technologies. Int. J. Adv. Res. Comput. Sci. 9, 379–384 (2018)
Gannon, D., Barga, R., Sundaresan, N.: Cloud-native applications. IEEE Cloud Comput. 4(5), 16–21 (2017). https://doi.org/10.1109/MCC.2017.4250939
McGrath, G., Brenner, P.: Serverless computing: design, implementation, and performance. In: 2017 IEEE 37th International Conference on Distributed Computing Systems Workshops (ICDCSW), Atlanta, GA, pp. 405–410 (2017). https://doi.org/10.1109/ICDCSW.2017.36
Baldini, I., et al.: Serverless computing: current trends and open problems. In: Chaudhary, S., Somani, G., Buyya, R. (eds.) Research Advances in Cloud Computing, pp. 1–20. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-5026-8_1
Elgendy, N., Elragal, A.: Big data analytics: a literature review paper. In: Perner, P. (ed.) ICDM 2014. LNCS (LNAI), vol. 8557, pp. 214–227. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08976-8_16
Jovanovic, P., Romero, O., Abelló, A.: A unified view of data-intensive flows in business intelligence systems: a survey. In: Hameurlain, A., Küng, J., Wagner, R. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXIX. LNCS, vol. 10120, pp. 66–107. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-54037-4_3
Kim, Y., Lin, J.: Serverless data analytics with flint. In: 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), San Francisco, CA, pp. 451–455 (2018). https://doi.org/10.1109/CLOUD.2018.00063
Gerber, A., le Roux, P., Kearney, C., van der Merwe, A.: The Zachman framework for enterprise architecture: an explanatory IS theory. In: Hattingh, M., Matthee, M., Smuts, H., Pappas, I., Dwivedi, Y.K., Mäntymäki, M. (eds.) I3E 2020. LNCS, vol. 12066, pp. 383–396. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-44999-5_32
Zhou, G., Xie, Q., Hu, Y.: E-LT integration to heterogeneous data information for SMEs networking based on E-HUB. In: Fourth International Conference on Natural Computation, Jinan, pp. 212–216 (2008). https://doi.org/10.1109/ICNC.2008.77
Sabtu, A., et al.: The challenges of extract, transform and loading (ETL) system implementation for near real-time environment. In: 2017 International Conference on Research and Innovation in Information Systems (ICRIIS), Langkawi, pp. 1–5 (2017). https://doi.org/10.1109/ICRIIS.2017.8002467
Couvares, P., Kosar, T., Roy, A., Weber, J., Wenger, R.: Workflow in Condor. In: Taylor, I., Deelman, E., Gannon, D., Shields, M. (eds.) Workflows for e-Science. Springer, Heidelberg (2007)
Tannenbaum, T., Wright, D., Miller, K., Livny, M.: Condor: a distributed job scheduler. In: Beowulf Cluster Computing with Windows, pp. 307–350. MIT Press, Cambridge (2001)
Deelman, E., Blythe, J., Gil, Y., et al.: Mapping abstract complex workflows onto grid environments. J. Grid Comput. 1, 25–39 (2003). https://doi.org/10.1023/A:1024000426962
Ludäscher, B., et al.: Scientific workflow management and the Kepler system. Concurr. Comput.: Pract. Exper. 18, 1039–1065 (2006). https://doi.org/10.1002/cpe.994
Oinn, T., et al.: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17), 3045–3054 (2004). https://doi.org/10.1093/bioinformatics/bth361
Jacob, J., et al.: Montage: a grid portal and software toolkit for science-grade astronomical image mosaicking. Int. J. Comput. Sci. Eng. (IJCSE) 4(2) (2009). https://doi.org/10.1504/IJCSE.2009.026999
Abramovici, A., Althouse, W.: LIGO: The laser interferometer gravitational-wave observatory. Science 256(5055), 325–333 (1992). https://doi.org/10.1126/science.256.5055.325
Deelman, E., et al.: Pegasus: mapping scientific workflows onto the grid. In: Dikaiakos, M.D. (ed.) AxGrids 2004. LNCS, vol. 3165, pp. 11–20. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28642-4_2
Jiang, Q., Lee, Y.C., Zomaya, A.Y.: Serverless execution of scientific workflows. In: Maximilien, M., Vallecillo, A., Wang, J., Oriol, M. (eds.) ICSOC 2017. LNCS, vol. 10601, pp. 706–721. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69035-3_51
Alonso, G., Hagen, C., Agrawal, D., El Abbadi, A., Mohan, C.: Enhancing the fault tolerance of workflow management systems. IEEE Concurr. 8(3), 74–81 (2000). https://doi.org/10.1109/4434.865896
Couvares, P., Kosar, T., Roy, A., Weber, J., Wenger, K.: Workflow management in condor. In: Taylor, I.J., Deelman, E., Gannon, D.B., Shields, M. (eds.) Workflows for e-Science, pp. 357–375. Springer, London (2007). https://doi.org/10.1007/978-1-84628-757-2_22
Sun, L., Franklin, M., Krishnan, S., Xin, R.: Fine-grained partitioning for aggressive data skipping. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD 2014), pp. 1115–1126. Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2588555.2610515
Liu, X., Iftikhar, N.: Ontology-based big dimension modeling in data warehouse schema design. In: Abramowicz, W. (ed.) BIS 2013. LNBIP, vol. 157, pp. 75–87. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38366-3_7
Mandagere, N., Zhou, P., Smith, M., Uttamchandani, S.: Demystifying data deduplication. In: Proceedings of the ACM/IFIP/USENIX Middleware 2008 Conference Companion (Companion 2008), pp. 12–17. Association for Computing Machinery, New York (2008). https://doi.org/10.1145/1462735.1462739
Getoor, L., Machanavajjhala, A.: Entity resolution: theory, practice & open challenges. Proc. VLDB Endow. 5(12), 2018–2019 (2012). https://doi.org/10.14778/2367502.2367564
Perez-artega P.A., Guzmán L., Denneulin Y.: Cost comparison of lambda architecture implementations for transportation analytics using public cloud software as a service. Spec. Session Softw. Eng. Serv. Cloud Comput. (2018). https://doi.org/10.5220/0006869308550862
Acknowledgments
The research is financially supported by the Russian Foundation for Basic Research, projects 18-07-01434, 18-29-22096. The research was carried out using infrastructure of shared research facilities CKP «Informatics» (http://www.frccsc.ru/ckp) of FRC CSC RAS.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Bryzgalov, A., Stupnikov, S. (2021). A Cloud-Native Serverless Approach for Implementation of Batch Extract-Load Processes in Data Lakes. In: Sychev, A., Makhortov, S., Thalheim, B. (eds) Data Analytics and Management in Data Intensive Domains. DAMDID/RCDL 2020. Communications in Computer and Information Science, vol 1427. Springer, Cham. https://doi.org/10.1007/978-3-030-81200-3_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-81200-3_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-81199-0
Online ISBN: 978-3-030-81200-3
eBook Packages: Computer ScienceComputer Science (R0)