Definition
A data lake is a data repository in which datasets from multiple sources are stored in their original structures. It should provide functions to extract data and metadata from heterogeneous sources and to ingest them into a hybrid storage system. In addition, a data lake should offer a data transformation engine, in which datasets can be transformed, cleaned, and integrated with other datasets. Finally, interfaces to explore and to query the data and metadata of a data lake should be also available in a data lake system.
Overview
The term “data lake” (DL) was first mentioned by James Dixon in 2010 in a blog post (https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/) where he put data marts on the same level as bottled water, which is cleansed, packaged, and structured for easy consumption. In contrast, a data lake manages the raw data as it is ingested from the data sources.
In the initial article (and in a later, more detailed article (https://jamesdixon.wordpress.com/2014/09/25/data-lakes-revisited/...
This is a preview of subscription content, log in via an institution.
References
Abedjan Z, Golab L, Naumann F (2015) Profiling relational data: a survey. VLDB J 24(4):557–581. https://doi.org/10.1007/s00778-015-0389-y
Alserafi A, Calders T, Abelló A, Romero O (2017) Ds-prox: dataset proximity mining for governing the data lake. In: Beecks C, Borutta F, Kröger P, Seidl T (eds) Proceedings of 10th international conference similarity search and applications, SISAP 2017, Munich, 4–6 Oct 2017. Lecture notes in computer science, vol 10609, pp 284–299. Springer. https://doi.org/10.1007/978-3-319-68474-1_20
Bernstein PA, Melnik S (2007) Model management 2.0: manipulating richer mappings. In: Zhou L, Ling TW, Ooi BC (eds) Proceedings of ACM SIGMOD international conference on management of data. ACM Press, Beijing, pp 1–12. https://doi.org/10.1145/1247480.1247482
Boci E, Thistlethwaite S (2015) A novel big data architecture in support of ads-b data analytic. In: Proceedings of integrated communication, navigation, and surveillance conference (ICNS), pp C1-1–C1-8. https://doi.org/10.1109/ICNSURV.2015.7121218
Calvanese D, De Giacomo G, Lenzerini M, Vardi MY (2012) Query processing under glav mappings for relational and graph databases. Proc VLDB Endow 6(2):61–72
Curino C, Moon HJ, Deutsch A, Zaniolo C (2013) Automating the database schema evolution process. VLDB J 22(1):73–98
Douglas C, Curino C (2015) Blind men and an elephant coalescing open-source, academic, and industrial perspectives on bigdata. In: Gehrke J, Lehner W, Shim K, Cha SK, Lohman GM (eds) 31st IEEE international conference on data engineering, ICDE 2015, Seoul, 13–17 Apr 2015. IEEE Computer Society, pp 1523–1526. https://doi.org/10.1109/ICDE.2015.7113417. http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=7109453
Florescu D, Fourny G (2013) Jsoniq: the history of a query language. IEEE Internet Comput 17(5):86–90
Franklin M, Halevy A, Maier D (2005) From databases to dataspaces: a new abstraction for information management. SIGMOD Rec 34(4):27–33. https://doi.org/10.1145/1107499.1107502
Gottlob G, Orsi G, Pieris A (2014) Query rewriting and optimization for ontological databases. ACM Trans Database Syst 39(3):25:1–25:46. https://doi.org/10.1145/2638546
Halevy AY, Korn F, Noy NF, Olston C, Polyzotis N, Roy S, Whang SE (2016) Managing Google’s data lake: an overview of the goods system. IEEE Data Eng Bull 39(3):5–14. http://sites.computer.org/debull/A16sept/p5.pdf
Hartung M, Terwilliger JF, Rahm E (2011) Recent advances in schema and ontology evolution. In: Bellahsene Z, Bonifati A, Rahm E (eds) Schema matching and mapping, data-centric systems and applications. Springer, pp 149–190. https://doi.org/10.1007/978-3-642-16518-4
Jarke M, Quix C (2017) On warehouses, lakes, and spaces: the changing role of conceptual modeling for data integration. In: Cabot J, Gómez C, Pastor O, Sancho M, Teniente E (eds) Conceptual modeling perspectives. Springer, pp 231–245. https://doi.org/10.1007/978-3-319-67271-7_16
Jarke M, Jeusfeld MA, Quix C, Vassiliadis P (1999) Architecture and quality in data warehouses: an extended repository approach. Inf Syst 24(3):229–253
Jeffery SR, Franklin MJ, Halevy AY (2008) Pay-as-you-go user feedback for dataspace systems. In: Wang JTL (ed) Proceedings of ACM SIGMOD international conference on management of data. ACM Press, Vancouver, pp 847–860. https://doi.org/10.1145/1376616.1376701
Karæz Y, Ivanova M, Zhang Y, Manegold S, Kersten ML (2013) Lazy ETL in action: ETL technology dates scientific data. PVLDB 6(12):1286–1289. http://www.vldb.org/pvldb/vol6/p1286-kargin.pdf
Kensche D, Quix C, Li X, Li Y, Jarke M (2009) Generic schema mappings for composition and query answering. Data Knowl Eng 68(7):599–621. https://doi.org/10.1016/j.datak.2009.02.006
LaPlante A, Sharma B (2016) Architecting data lakes. O’Reilly Media, Sebastopol, CA, USA
Mathis C (2017) Data lakes. Datenbank-Spektrum 17(3):289–293. https://doi.org/10.1007/s13222-017-0272-7
Otto B (2011) Data governance. Bus Inf Syst Eng 3(4):241–244. https://doi.org/10.1007/s12599-011-0162-8
Quix C, Berlage T, Jarke M (2016) Interactive pay-as-you-go-integration of life science data: the HUMIT approach. ERCIM News 2016(104). http://ercim- news.ercim.eu/en104/special/interactive-pay-as-you- go-integration-of-life-science-data-the-humit-approach
Saha B, Srivastava D (2014) Data quality: the other face of big data. In: Cruz IF, Ferrari E, Tao Y, Bertino E, Trajcevski G (eds) Proceedings of 30th international conference on data engineering (ICDE). IEEE, Chicago, pp 1294–1297. https://doi.org/10.1109/ICDE.2014.6816764
Sarma AD, Dong X, Halevy AY (2008) Bootstrapping pay-as-you-go data integration systems. In: Wang JTL (ed) Proceedings of ACM SIGMOD international conference on management of data. ACM Press, Vancouver, pp 861–874
Stein B, Morrison A (2014) The enterprise data lake: better integration and deeper analytics. http:// www.pwc.com/us/en/technology-forecast/2014/cloud- computing/assets/pdf/pwc-technology-forecast-data- lakes.pdf
Terrizzano I, Schwarz PM, Roth M, Colino JE (2015) Data wrangling: the challenging journey from the wild to the lake. In: 7th Biennial conference on innovative data systems (CIDR). http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper2.pdf
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this entry
Cite this entry
Quix, C., Hai, R. (2018). Data Lake. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_7-1
Download citation
DOI: https://doi.org/10.1007/978-3-319-63962-8_7-1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63962-8
Online ISBN: 978-3-319-63962-8
eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering