Abstract
Data types and data structures are becoming increasingly complex as they keep pace with evolving technologies and applications. The result is an increase in the number and complexity of data quality problems. Data glitches, a common name for data quality problems, can be simple and stand alone, or highly complex with spatial and temporal correlations. In this chapter, we provide an overview of a comprehensive and measurable data quality process. To begin, we define and classify complex glitch types, and describe detection and cleaning techniques. We present metrics for assessing data quality and for choosing cleaning strategies subject to a variety of considerations. The process culminates in a “clean” data set that is acceptable to the end user. We conclude with an overview of significant literature in this area, and a discussion of opportunities for practice, application, and further research.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, Chichester
Berti-Equille L, Dasu T (2009) Advances in data quality mining. Tutorial, KDD
Berti-Equille L, Dasu T, Srivastava D (2011) Discovery of complex glitch patterns: a novel approach to quantitative data cleaning. In: 2011 IEEE 27th international conference on data engineering (ICDE)
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3), Article 15, 58 p
Dasu T, Johnson T (2003) Exploratory data mining and data cleaning. Wiley, New York
Dasu T, Johnson T, Muthukrishnan S, Shkapenyuk V (2002) Mining database structure; or, how to build a data quality browser. In: Proceedings of the SIGMOD
Dasu T, Loh JM (2012) Statistical distortion: consequences of data cleaning. PVLDB 5(11):1674–1683
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection a survey. IEEE Trans Knowledge Data Eng 19(1):1–16
Golab L, Saha A, Karloff H, Srivastava D, Korn P (2009) Sequential dependencies. PVLDB 2(1):574–585
Kriegel H, Kroger P, Zimek A (2009) Outlier detection techniques. Tutorial, PAKDD
Liu X, Dong XL, Ooi BC, Srivastava D (2011) Online data fusion. PVLDB 4(11):932–943
Rao CR (1973) Linear statistical inference and its applications. Wiley, New York
Redman T (1997) Data quality for the information age. Artech House, Norwood
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Dasu, T. (2013). Data Glitches: Monsters in Your Data. In: Sadiq, S. (eds) Handbook of Data Quality. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36257-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-36257-6_8
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36256-9
Online ISBN: 978-3-642-36257-6
eBook Packages: Computer ScienceComputer Science (R0)