Abstract
Conditional functional dependencies (CFDs) are a critical technique for detecting inconsistencies while they may ignore some potential inconsistencies without considering the content relationship of data. Content-related conditional functional dependencies (CCFDs) are a type of special CFDs, which combine content-related CFDs and detect potential inconsistencies by putting content-related data together. In the process of cleaning inconsistencies, detection and repairing are interactive: 1) detection catches inconsistencies, 2) repairing corrects caught inconsistencies while may bring new inconsistencies. Besides, data are often fragmented and distributed into multiple sites. It consequently costs expensive shipment for inconsistencies cleaning. In this paper, our aim is to repair inconsistencies in distributed content-related data. We propose a framework consisting of an inconsistencies detection method and an inconsistencies repairing method, which work iteratively. The detection method marks the violated CCFDs for computing the inconsistencies which should be repaired preferentially. Based on the repairing-cost model presented in this paper, we prove that the minimum-cost repairing using CCFDs is NP-complete. Therefore, the repairing method heuristically repairs the inconsistencies with minimum cost. To improve the efficiency and accuracy of repairing, we propose distinct values and rules sequences. Distinct values make less data shipments than real data for communication. Rules sequences determine appropriate repairing sequences to avoid some incorrect repairs. Our solution is proved to be more effective than CFDs by empirical evaluation on two real-life datasets.
Similar content being viewed by others
References
Fan W, Geerts F. Foundations of Data Quality Management. Morgan & Claypool, 2012.
Fan W, Geerts F, Jia X, Kementsietsidis A. Conditional functional dependencies for capturing data inconsistencies. ACM Transactions on Database Systems (TODS), 2008, 33(2): Article No. 6
Papenbrock T, Ehrlich J, Marten J, Neubert T, Rudolph J P, Schönberg M, Zwiener J, Naumann F. Functional dependency discovery: An experimental evaluation of seven algorithms. Proceedings of the VLDB Endowment, 2015, 8(10): 1082-1093.
Cong G, Fan W, Geerts F, Jia X, Ma S. Improving data quality: Consistency and accuracy. In Proc. the 33rd International Conference on Very Large Data Bases, Sept. 2007, pp.315-326.
Alwan A A, Ibrahim H, Udzir N I. Improved integrity constraints checking in distributed databases by exploiting local checking. Journal of Computer Science and Technology, 2009, 24(4): 665-674.
Du Y, Shen D, Nie T, Kou Y, Yu G. Discovering condition-combined functional dependency rules. In Proc. the 16th APWeb, Sept. 2014, pp.247-257.
Fan W, Li J, Tang N, Yu W. Incremental detection of inconsistencies in distributed data. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(6): 1367-1383.
Fan W, Ma S, Tang N, Yu W. Interaction between record matching and data repairing. Journal of Data and Information Quality (JDIQ), 2014, 4(4): Article No. 16.
Christen P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(9): 1537-1555.
Li X, Dong X L, Lyons K, Meng W, Srivastava D. Truth finding on the deep web: Is the problem solved? Proceedings of the VLDB Endowment, 2012, 6(2): 97-108.
Bohannon P, Fan W, Flaster M, Rastogi R. A cost-based model and effective heuristic for repairing constraints by value modification. In Proc. the 31st ACM SIGMOD International Conference on Management of Data, June 2005, pp.143-154.
Wu A H, Tan Z J, Wang W. Annotation based query answer over inconsistent database. Journal of Computer Science and Technology, 2012, 25(3): 469-481.
Chiang F, Miller R J. A unified model for data and constraint repair. In Proc. the 27th International Conference on Data Engineering, April 2011, pp.446-457.
Maher M J. Constrained dependencies. Theoretical Computer Science, 1997, 173(1): 113-149.
Bravo L, Fan W, Ma S. Extending dependencies with conditions. In Proc. the 33rd International Conference on Very Large Data Bases, Sept. 2007, pp.243-254.
Chu X, Ilyas I F, Papotti P. Discovering denial constraints. Proceedings of the VLDB Endowment, 2013, 6(13): 1498-1509.
Wang J, Tang N. Towards dependable data repairing with fixing rules. In Proc. the 40th ACM SIGMOD International Conference on Management of Data, June 2014, pp.457-468.
Wu L, Yuan L, You J. Survey of large-scale data management systems for big data applications. Journal of Computer Science and Technology, 2015, 30(1): 163-183.
Chen Q, Tan Z, He C, Sha C, Wang W. Repairing functional dependency violations in distributed data. In Proc. the 20th Database Systems for Advanced Applications, April 2015, pp.441-457.
Chiang F, Miller R J. Discovering data quality rules. Proceedings of the VLDB Endowment, 2008, 1(1): 1166-1177.
Fan W, Geerts F, Ma S, Müller H. Detecting inconsistencies in distributed data. In Proc. the 26th International Conference on Data Engineering (ICDE), March 2010, pp.64-75.
Dallachiesa M, Ebaid A, Eldawy A, Elmagarmid A, Ilyas I F, Ouzzani M, Tang N. NADEEF: A commodity data cleaning system. In Proc. the 39th ACM SIGMOD International Conference on Management of Data, June 2013, pp.541-552.
Khayyat Z, Ilyas I F, Jindal A, Madden S, Ouzzani M, Papotti P, Quiané-Ruiz J A, Tang N, Yin S. BigDansing: A system for big data cleansing. In Proc. the 2015 ACM SIGMOD International Conference on Management of Data, May 2015, pp.1215-1230.
Chomicki J, Marcinkowski J. Minimal-change integrity maintenance using tuple deletions. Information and Computation, 2005, 197(1/2): 90-121.
Cormen T H, Leiserson C E, Rivest R L, Stein C. Introduction to Algorithms. MIT Press and McGraw-Hill, 2001.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Du, YF., Shen, DR., Nie, TZ. et al. Content-Related Repairing of Inconsistencies in Distributed Data. J. Comput. Sci. Technol. 31, 741–758 (2016). https://doi.org/10.1007/s11390-016-1660-4
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-016-1660-4