Abstract
This study describes a statistically motivated approach to constraint-based data cleansing that derives the cause of errors from a distribution of conflicting tuples. In real-world dirty data, errors are often not randomly distributed. Rather, they often occur only under certain conditions, such as when the transaction is handled by a certain operator, or the weather is rainy. Leveraging such common conditions, or “cause conditions”, the algorithm resolves multi-tuple conflicts with high speed, as well as high accuracy in realistic settings where the distribution of errors is skewed. We present complexity analyses of the problem, pointing out two subproblems that are NP-complete. We then introduce, for each subproblem, heuristics that work in sub-polynomial time. The algorithms are tested with three sets of data and rules. The experiments show that, compared to the state-of-the-art methods for Conditional Functional Dependencies (CFD)-based and FD-based data cleansing, the proposed algorithm scales better with respect to the data size, is the only method that outputs complete repairs, and is more accurate when the error distribution is skewed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: ICDE, pp. 746–755 (2007)
Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: VLDB, pp. 315–326 (2007)
Chiang, F., Miller, R.J.: Discovering data quality rules. PVLDB 1(1), 1166–1177 (2008)
Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. PVLDB 3(1), 173–184 (2010)
Fan, W., Geerts, F.: Capturing missing tuples and missing values. In: PODS, pp. 169–178 (2010)
Yeh, P.Z., Puri, C.A.: Discovering conditional functional dependencies to detect data inconsistencies. In: Proceedings of the Fifth International Workshop on Quality in Databases at VLDB2010, (2010)
Beskales, G., Ilyas, I.F., Golab, L.: Sampling the repairs of functional dependency violations under hard constraints. VLDB Endowment 3(1–2), 197–207 (2010)
Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Interaction between record matching and data repairing. In: SIGMOD Conference, pp. 469–480 (2011)
Bertossi, L., Bravo, L., Franconi, E., Lopatenko, A.: The complexity and approximation of fixing numerical attributes in databases under integrity constraints. Inf. Sys. 33(4–5), 407–434 (2008)
Chomicki, J., Marcinkowski, J.: Minimal-change integrity maintenance using tuple deletions. Inf. Comput. 197(1–2), 90–121 (2005)
Kolahi, S., Lakshmanan, L.V.S.: On approximating optimum repairs for functional dependency violations. In: Proceedings of the 12th International Conference on Database Theory, service, ICDT 2009, pp. 53–62. ACM, New York (2009)
Chandel, A., Koudas, N., Pu, K.Q., Srivastava, D.: Fast identification of relational constraint violations. In: Proceedings of the 2007 ICDE Conference, pp. 776–785. IEEE Computer Society, The Marmara Hotel, Istanbul (2007)
Weijie Wei, B.Z.X.T., Zhang, M.: A data cleaning method based on association rules. In: ISKE International Conference on Intelligent Systems and Knowledge Engineering (2007)
Bohannon, P., Flaster, M., Fan, W., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: SIGMOD Conference, pp. 143–154 (2005)
Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)
Golab, L., Karloff, H.J., Korn, F., Srivastava, D., Yu, B.: On generating near-optimal tableaux for conditional functional dependencies. PVLDB 1(1), 376–390 (2008)
Stoyanovich, J., Davidson, S.B., Milo, T., Tannen, V.: Deriving probabilistic databases with inference ensembles. In: ICDE, pp. 303–314 (2011)
Berti-Equille, L., Dasu, T., Srivastava, D.: Discovery of complex glitch patterns: A novel approach to quantitative data cleaning. In: ICDE, pp. 733–744 (2011)
Zaki, M.J., Ogihara, M.: Theoretical foundations of association rules. In: 3rd ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (1998)
Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. SIGMOD Rec. 22(2), 207–216 (1993)
Gray, J., Sundaresan, P., Englert, S., Baclawski, K., Weinberger, P. J.: Quickly generating billion-record synthetic databases. In: Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, service, SIGMOD 1994, pp. 243–252 (1994)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Hoshino, A., Nakayama, H., Ito, C., Kanno, K., Nishimura, K. (2015). Leveraging the Common Cause of Errors for Constraint-Based Data Cleansing. In: Li, XL., Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D. (eds) Trends and Applications in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science(), vol 9441. Springer, Cham. https://doi.org/10.1007/978-3-319-25660-3_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-25660-3_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25659-7
Online ISBN: 978-3-319-25660-3
eBook Packages: Computer ScienceComputer Science (R0)