Data cleaning is used to refer to all kinds of tasks and activities to detect and repair errors in the data.
Enterprises have been acquiring large amounts of data from a variety of sources to build their own “data lakes,” with the goal of enriching their data asset and enabling richer and more informed analytics. Data collection and acquisition often introduce errors in data, for example, missing values, typos, mixed formats, replicated entries for the same real-world entity, outliers, and violations of business rules.
A Kaggle’s 2017 survey about the state of data science and machine learning reveals that dirty data is the most common barrier faced by workers dealing with data (Kaggle 2017). Not surprisingly, developing effective and efficient data cleaning solutions is a challenging venue and is rich with deep theoretical and engineering problems.
There are various surveys and books on different aspects of data quality and data cleaning....
- Abedjan Z, Morcos J, Ilyas IF, Ouzzani M, Papotti P, Stonebraker M (2016) Dataxformer: a robust transformation discovery system. In: Proceedings of 32nd international conference on data engineering, pp 1134–1145Google Scholar
- Arasu A, Götz M, Kaushik R (2010) On active learning of record matching packages. In: Proceedings of ACM SIGMOD international conference on management of data, pp 783–794Google Scholar
- Chawla S, Sun P (2006) Outlier detection: principles, techniques and applications. In: Advances in knowledge discovery and data mining, 10th Pacific-Asia conferenceGoogle Scholar
- Chu X, Ilyas IF, Papotti P (2013) Holistic data cleaning: putting violations into context. In: Proceedings of 29th international conference on data engineering, pp 458–469Google Scholar
- Gulwani S (2011) Automating string processing in spreadsheets using input-output examples. In: Proceedings 38th ACM SIGACT-SIGPLAN symposium on principles of programming languages, pp 317–330Google Scholar
- Hellerstein JM (2008) Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE)Google Scholar
- Kaggle (2017) https://goo.gl/ZAZGsD
- Kandel S, Paepcke A, Hellerstein J, Heer J (2011) Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of SIGCHI conference on human factors in computing systems, pp 3363–3372Google Scholar
- Knorr EM, Ng RT (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of 24th international conference on very large data bases, pp 392–403Google Scholar
- Rahm E, Do HH (2000) Data cleaning: problems and current approaches. IEEE Data Eng Bull 23:2000Google Scholar
- Raman V, Hellerstein JM (2001) Potter’s wheel: an interactive data cleaning system. In: Proceedings of 27th international conference on very large data bases, pp 381–390Google Scholar
- Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 269–278Google Scholar