Data Cleaning

Ganti, Venkatesh

doi:10.1007/978-1-4614-8265-9_592

Data Cleaning

Venkatesh Ganti³

Reference work entry
First Online: 01 January 2018

62 Accesses
1 Citations

Definition

Owing to differences in conventions between the external sources and the target data warehouse as well as due to a variety of errors, data from external sources may not conform to the standards and requirements at the data warehouse. Therefore, data has to be transformed and cleaned before it is loaded into a data warehouse so that downstream data analysis is reliable and accurate. Data Cleaning is the process of standardizing data representation and eliminating errors in data. The data cleaning process often involves one or more tasks each of which is important on its own. Each of these tasks addresses a part of the overall data cleaning problem. In addition to tasks which focus on transforming and modifying data, the problem of diagnosing quality of data in a database is important. This diagnosis process, often called data profiling, can usually identify data quality issues and whether or not the data cleaning process is meeting its goals.

Historical Background

Many...

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 4,499.99; Price excludes VAT (USA)

Hardcover Book: USD 6,499.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Recommended Reading

Borkar V, Deshmukh V, Sarawagi S. Automatic segmentation of text into structured records. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2001.
Google Scholar
Cafarella MJ, Re C, Suciu D, Etzioni O, Banko M Structured querying of the web text. In: Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research; 2007.
Google Scholar
Chaudhuri S, Ganti V, Kaushik R. Data debugger: an operator-centric approach for data quality solutions. IEEE Data Eng Bull. 2006a;29(2):60–6.
Google Scholar
Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data Engineering; 2006b.
Google Scholar
Cohen W. Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 1998.
Google Scholar
Fuxman A, Fazli E, Miller RJ. Conquer: efficient management of inconsistent databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2005.
Google Scholar
Galhardas H, Florescu D, Shasha D, Simon E. An extensible framework for data cleaning. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 1999.
Google Scholar
Galhardas H, Florescu D, Shasha D, Simon E, Saita C. Declarative data cleaning: language, model, and algorithms. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001.
Google Scholar
Gravano L, Ipeirotis PG, Jagadish HV, Koudas N, Muthukrishnan S, Srivastava D. Approximate string joins in a database (almost) for free. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001.
Google Scholar
Hernandez M, Stolfo S. The merge/purge problem for large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 1995.
Google Scholar
IBM Websphere information integration. http://ibm.ascential.com.
Ipeirotis PG, Agichtein E, Jain P, Gravano L. To search or to crawl? towards a query optimizer for text-centric tasks. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2006.
Google Scholar
Microsoft SQL Server 2005 integration services.
Google Scholar
Rahm E, Do HH. Data cleaning: problems and current approaches. IEEE Data Eng Bull. 2000;23(4):3–13.
Google Scholar
Raman V, Hellerstein J. An interactive framework for data cleaning. Technical report, University of California, Berkeley; 2000.
Google Scholar
Sarawagi S, Kirpal A. Efficient set joins on similarity predicates. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2004.
Google Scholar
Trillium Software. www.trilliumsoft.com/tri lliumsoft.nsf.

Download references

Author information

Authors and Affiliations

Microsoft Research, Microsoft Corporation, Redmond, WA, USA
Venkatesh Ganti

Authors

Venkatesh Ganti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Venkatesh Ganti .

Editor information

Editors and Affiliations

Georgia Institute of Technology College of Computing, Atlanta, GA, USA
Ling Liu
University of Waterloo School of Computer Science, Waterloo, ON, Canada
M. Tamer Özsu

Section Editor information

Microsoft Research, Microsoft Corporation, Redmond, WA, USA
Venkatesh Ganti

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Ganti, V. (2018). Data Cleaning. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_592

Download citation

DOI: https://doi.org/10.1007/978-1-4614-8265-9_592
Published: 07 December 2018
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics