Encyclopedia of Big Data Technologies

2019 Edition
| Editors: Sherif Sakr, Albert Y. Zomaya

Data Cleaning

Reference work entry
DOI: https://doi.org/10.1007/978-3-319-77525-8_3



Data cleaning is used to refer to all kinds of tasks and activities to detect and repair errors in the data.


Enterprises have been acquiring large amounts of data from a variety of sources to build their own “data lakes,” with the goal of enriching their data asset and enabling richer and more informed analytics. Data collection and acquisition often introduce errors in data, for example, missing values, typos, mixed formats, replicated entries for the same real-world entity, outliers, and violations of business rules.

A Kaggle’s 2017 survey about the state of data science and machine learning reveals that dirty data is the most common barrier faced by workers dealing with data (Kaggle 2017). Not surprisingly, developing effective and efficient data cleaning solutions is a challenging venue and is rich with deep theoretical and engineering problems.

There are various surveys and books on different aspects of data quality and data cleaning....

This is a preview of subscription content, log in to check access.


  1. Abedjan Z, Morcos J, Ilyas IF, Ouzzani M, Papotti P, Stonebraker M (2016) Dataxformer: a robust transformation discovery system. In: Proceedings of 32nd international conference on data engineering, pp 1134–1145Google Scholar
  2. Aggarwal CC (2013) Outlier analysis. Springer, New YorkzbMATHCrossRefGoogle Scholar
  3. Arasu A, Götz M, Kaushik R (2010) On active learning of record matching packages. In: Proceedings of ACM SIGMOD international conference on management of data, pp 783–794Google Scholar
  4. Bertossi LE (2011) Database repairing and consistent query answering. Morgan & Claypool Publishers, San RafaelCrossRefGoogle Scholar
  5. Chawla S, Sun P (2006) Outlier detection: principles, techniques and applications. In: Advances in knowledge discovery and data mining, 10th Pacific-Asia conferenceGoogle Scholar
  6. Chu X, Ilyas IF, Papotti P (2013) Holistic data cleaning: putting violations into context. In: Proceedings of 29th international conference on data engineering, pp 458–469Google Scholar
  7. Dasu T, Johnson T (2003) Exploratory data mining and data cleaning. Wiley, HobokenzbMATHCrossRefGoogle Scholar
  8. De Stefano C, Sansone C, Vento M (2000) To reject or not to reject: that is the question-an answer in case of neural classifiers. IEEE Trans Syst Man Cybern 30(1): 84–94CrossRefGoogle Scholar
  9. Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16CrossRefGoogle Scholar
  10. Fan W, Geerts F (2012) Foundations of data quality management. Synthesis lectures on data management. Morgan & Claypool Publishers, San RafaelzbMATHGoogle Scholar
  11. Ganti V, Sarma AD (2013) Data cleaning: a practical perspective. Synth Lect Data Manag 5(3):1–85CrossRefGoogle Scholar
  12. Grubbs FE (1969) Procedures for detecting outlying observations in samples. Technometrics 11(1):1–21CrossRefGoogle Scholar
  13. Gulwani S (2011) Automating string processing in spreadsheets using input-output examples. In: Proceedings 38th ACM SIGACT-SIGPLAN symposium on principles of programming languages, pp 317–330Google Scholar
  14. Hawkins D (1980) Identification of outliers, vol 11. Chapman and Hall, LondonzbMATHCrossRefGoogle Scholar
  15. Hellerstein JM (2008) Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE)Google Scholar
  16. Hodge VJ, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2):85–126zbMATHCrossRefGoogle Scholar
  17. Ilyas IF, Chu X et al (2015) Trends in cleaning relational data: consistency and deduplication. Found Trends® Databases 5(4):281–393zbMATHCrossRefGoogle Scholar
  18. Kaggle (2017) https://goo.gl/ZAZGsD
  19. Kandel S, Paepcke A, Hellerstein J, Heer J (2011) Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of SIGCHI conference on human factors in computing systems, pp 3363–3372Google Scholar
  20. Knorr EM, Ng RT (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of 24th international conference on very large data bases, pp 392–403Google Scholar
  21. Rahm E, Do HH (2000) Data cleaning: problems and current approaches. IEEE Data Eng Bull 23:2000Google Scholar
  22. Raman V, Hellerstein JM (2001) Potter’s wheel: an interactive data cleaning system. In: Proceedings of 27th international conference on very large data bases, pp 381–390Google Scholar
  23. Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 269–278Google Scholar
  24. Singh R, Gulwani S (2012) Learning semantic string transformations from examples. Proc VLDB Endow 5(8):740–751CrossRefGoogle Scholar
  25. Tejada S, Knoblock CA, Minton S (2001) Learning object identification rules for information integration. Inf Syst 26(8):607–633zbMATHCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.School of Computer ScienceGeorgia Institute of TechnologyAtlantaUSA