A data lake is a data repository in which datasets from multiple sources are stored in their original structures. It should provide functions to extract data and metadata from heterogeneous sources and to ingest them into a hybrid storage system. In addition, a data lake should offer a data transformation engine, in which datasets can be transformed, cleaned, and integrated with other datasets. Finally, interfaces to explore and to query the data and metadata of a data lake should be also available in a data lake system.
The term “data lake” (DL) was first mentioned by James Dixon in 2010 in a blog post (https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/) where he put data marts on the same level as bottled water, which is cleansed, packaged, and structured for easy consumption. In contrast, a data lake manages the raw data as it is ingested from the data sources.
In the initial article (and in a later, more detailed article (https://jamesdixon.wordpress.com/2014/09/25/data-lakes-revisited/...
- Alserafi A, Calders T, Abelló A, Romero O (2017) Ds-prox: dataset proximity mining for governing the data lake. In: Beecks C, Borutta F, Kröger P, Seidl T (eds) Proceedings of 10th international conference similarity search and applications, SISAP 2017, Munich, 4–6 Oct 2017. Lecture notes in computer science, vol 10609, pp 284–299. Springer. https://doi.org/10.1007/978-3-319-68474-1_20CrossRefGoogle Scholar
- Boci E, Thistlethwaite S (2015) A novel big data architecture in support of ads-b data analytic. In: Proceedings of integrated communication, navigation, and surveillance conference (ICNS), pp C1-1–C1-8. https://doi.org/10.1109/ICNSURV.2015.7121218
- Douglas C, Curino C (2015) Blind men and an elephant coalescing open-source, academic, and industrial perspectives on bigdata. In: Gehrke J, Lehner W, Shim K, Cha SK, Lohman GM (eds) 31st IEEE international conference on data engineering, ICDE 2015, Seoul, 13–17 Apr 2015. IEEE Computer Society, pp 1523–1526. https://doi.org/10.1109/ICDE.2015.7113417. http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=7109453
- Karæz Y, Ivanova M, Zhang Y, Manegold S, Kersten ML (2013) Lazy ETL in action: ETL technology dates scientific data. PVLDB 6(12):1286–1289. http://www.vldb.org/pvldb/vol6/p1286-kargin.pdf
- LaPlante A, Sharma B (2016) Architecting data lakes. O’Reilly Media, Sebastopol, CA, USAGoogle Scholar
- Quix C, Berlage T, Jarke M (2016) Interactive pay-as-you-go-integration of life science data: the HUMIT approach. ERCIM News 2016(104). http://ercim- news.ercim.eu/en104/special/interactive-pay-as-you- go-integration-of-life-science-data-the-humit-approach
- Sarma AD, Dong X, Halevy AY (2008) Bootstrapping pay-as-you-go data integration systems. In: Wang JTL (ed) Proceedings of ACM SIGMOD international conference on management of data. ACM Press, Vancouver, pp 861–874Google Scholar
- Stein B, Morrison A (2014) The enterprise data lake: better integration and deeper analytics. http:// www.pwc.com/us/en/technology-forecast/2014/cloud- computing/assets/pdf/pwc-technology-forecast-data- lakes.pdf
- Terrizzano I, Schwarz PM, Roth M, Colino JE (2015) Data wrangling: the challenging journey from the wild to the lake. In: 7th Biennial conference on innovative data systems (CIDR). http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper2.pdf