Leveraging the Common Cause of Errors for Constraint-Based Data Cleansing

Hoshino, Ayako; Nakayama, Hiroki; Ito, Chihiro; Kanno, Kyota; Nishimura, Kenshi

doi:10.1007/978-3-319-25660-3_14

Ayako Hoshino¹⁹,
Hiroki Nakayama²¹,
Chihiro Ito²⁰,
Kyota Kanno¹⁹ &
…
Kenshi Nishimura²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9441))

795 Accesses

Abstract

This study describes a statistically motivated approach to constraint-based data cleansing that derives the cause of errors from a distribution of conflicting tuples. In real-world dirty data, errors are often not randomly distributed. Rather, they often occur only under certain conditions, such as when the transaction is handled by a certain operator, or the weather is rainy. Leveraging such common conditions, or “cause conditions”, the algorithm resolves multi-tuple conflicts with high speed, as well as high accuracy in realistic settings where the distribution of errors is skewed. We present complexity analyses of the problem, pointing out two subproblems that are NP-complete. We then introduce, for each subproblem, heuristics that work in sub-polynomial time. The algorithms are tested with three sets of data and rules. The experiments show that, compared to the state-of-the-art methods for Conditional Functional Dependencies (CFD)-based and FD-based data cleansing, the proposed algorithm scales better with respect to the data size, is the only method that outputs complete repairs, and is more accurate when the error distribution is skewed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: ICDE, pp. 746–755 (2007)
Google Scholar
Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: VLDB, pp. 315–326 (2007)
Google Scholar
Chiang, F., Miller, R.J.: Discovering data quality rules. PVLDB 1(1), 1166–1177 (2008)
Google Scholar
Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. PVLDB 3(1), 173–184 (2010)
Google Scholar
Fan, W., Geerts, F.: Capturing missing tuples and missing values. In: PODS, pp. 169–178 (2010)
Google Scholar
Yeh, P.Z., Puri, C.A.: Discovering conditional functional dependencies to detect data inconsistencies. In: Proceedings of the Fifth International Workshop on Quality in Databases at VLDB2010, (2010)
Google Scholar
Beskales, G., Ilyas, I.F., Golab, L.: Sampling the repairs of functional dependency violations under hard constraints. VLDB Endowment 3(1–2), 197–207 (2010)
Article Google Scholar
Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Interaction between record matching and data repairing. In: SIGMOD Conference, pp. 469–480 (2011)
Google Scholar
Bertossi, L., Bravo, L., Franconi, E., Lopatenko, A.: The complexity and approximation of fixing numerical attributes in databases under integrity constraints. Inf. Sys. 33(4–5), 407–434 (2008)
Article MATH Google Scholar
Chomicki, J., Marcinkowski, J.: Minimal-change integrity maintenance using tuple deletions. Inf. Comput. 197(1–2), 90–121 (2005)
Article MathSciNet MATH Google Scholar
Kolahi, S., Lakshmanan, L.V.S.: On approximating optimum repairs for functional dependency violations. In: Proceedings of the 12th International Conference on Database Theory, service, ICDT 2009, pp. 53–62. ACM, New York (2009)
Google Scholar
Chandel, A., Koudas, N., Pu, K.Q., Srivastava, D.: Fast identification of relational constraint violations. In: Proceedings of the 2007 ICDE Conference, pp. 776–785. IEEE Computer Society, The Marmara Hotel, Istanbul (2007)
Google Scholar
Weijie Wei, B.Z.X.T., Zhang, M.: A data cleaning method based on association rules. In: ISKE International Conference on Intelligent Systems and Knowledge Engineering (2007)
Google Scholar
Bohannon, P., Flaster, M., Fan, W., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: SIGMOD Conference, pp. 143–154 (2005)
Google Scholar
Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)
Article Google Scholar
Golab, L., Karloff, H.J., Korn, F., Srivastava, D., Yu, B.: On generating near-optimal tableaux for conditional functional dependencies. PVLDB 1(1), 376–390 (2008)
Google Scholar
Stoyanovich, J., Davidson, S.B., Milo, T., Tannen, V.: Deriving probabilistic databases with inference ensembles. In: ICDE, pp. 303–314 (2011)
Google Scholar
Berti-Equille, L., Dasu, T., Srivastava, D.: Discovery of complex glitch patterns: A novel approach to quantitative data cleaning. In: ICDE, pp. 733–744 (2011)
Google Scholar
Zaki, M.J., Ogihara, M.: Theoretical foundations of association rules. In: 3rd ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (1998)
Google Scholar
Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. SIGMOD Rec. 22(2), 207–216 (1993)
Article Google Scholar
Gray, J., Sundaresan, P., Englert, S., Baclawski, K., Weinberger, P. J.: Quickly generating billion-record synthetic databases. In: Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, service, SIGMOD 1994, pp. 243–252 (1994)
Google Scholar

Download references

Author information

Authors and Affiliations

Knowledge Discovery Research Laboratories, NEC Corporation, 1753, Shimonumabe Nakahara-ku, Kawasaki, Kanagawa, 211-8666, Japan
Ayako Hoshino & Kyota Kanno
System Integration, Services and Engineering Operations Unit, NEC Corporation, 1753, Shimonumabe Nakahara-ku, Kawasaki, Kanagawa, 211-8666, Japan
Chihiro Ito & Kenshi Nishimura
NEC Informatec Systems, Ltd., 1753, Shimonumabe Nakahara-ku, Kawasaki, Kanagawa, 211-8666, Japan
Hiroki Nakayama

Authors

Ayako Hoshino
View author publications
You can also search for this author in PubMed Google Scholar
Hiroki Nakayama
View author publications
You can also search for this author in PubMed Google Scholar
Chihiro Ito
View author publications
You can also search for this author in PubMed Google Scholar
Kyota Kanno
View author publications
You can also search for this author in PubMed Google Scholar
Kenshi Nishimura
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ayako Hoshino .

Editor information

Editors and Affiliations

Institute of Infocomm Research, Singapore, Singapore
Xiao-Li Li
Ho Chi Minh City University of Tech, Ho Chi Minh City, Vietnam
Tru Cao
School of Information Systems, Singapore Management University, Singapore, Singapore
Ee-Peng Lim
Nanjing University, Nanjing, China
Zhi-Hua Zhou
Science & Technology, Japan Advanced Institute of, Nomi-shi, Ishikawa, Japan
Tu-Bao Ho
The University of Hong Kong, Hong Kong, China
David Cheung

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hoshino, A., Nakayama, H., Ito, C., Kanno, K., Nishimura, K. (2015). Leveraging the Common Cause of Errors for Constraint-Based Data Cleansing. In: Li, XL., Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D. (eds) Trends and Applications in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science(), vol 9441. Springer, Cham. https://doi.org/10.1007/978-3-319-25660-3_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-25660-3_14
Published: 26 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25659-7
Online ISBN: 978-3-319-25660-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics