Skip to main content
Log in

Content-Related Repairing of Inconsistencies in Distributed Data

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Conditional functional dependencies (CFDs) are a critical technique for detecting inconsistencies while they may ignore some potential inconsistencies without considering the content relationship of data. Content-related conditional functional dependencies (CCFDs) are a type of special CFDs, which combine content-related CFDs and detect potential inconsistencies by putting content-related data together. In the process of cleaning inconsistencies, detection and repairing are interactive: 1) detection catches inconsistencies, 2) repairing corrects caught inconsistencies while may bring new inconsistencies. Besides, data are often fragmented and distributed into multiple sites. It consequently costs expensive shipment for inconsistencies cleaning. In this paper, our aim is to repair inconsistencies in distributed content-related data. We propose a framework consisting of an inconsistencies detection method and an inconsistencies repairing method, which work iteratively. The detection method marks the violated CCFDs for computing the inconsistencies which should be repaired preferentially. Based on the repairing-cost model presented in this paper, we prove that the minimum-cost repairing using CCFDs is NP-complete. Therefore, the repairing method heuristically repairs the inconsistencies with minimum cost. To improve the efficiency and accuracy of repairing, we propose distinct values and rules sequences. Distinct values make less data shipments than real data for communication. Rules sequences determine appropriate repairing sequences to avoid some incorrect repairs. Our solution is proved to be more effective than CFDs by empirical evaluation on two real-life datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Fan W, Geerts F. Foundations of Data Quality Management. Morgan & Claypool, 2012.

  2. Fan W, Geerts F, Jia X, Kementsietsidis A. Conditional functional dependencies for capturing data inconsistencies. ACM Transactions on Database Systems (TODS), 2008, 33(2): Article No. 6

  3. Papenbrock T, Ehrlich J, Marten J, Neubert T, Rudolph J P, Schönberg M, Zwiener J, Naumann F. Functional dependency discovery: An experimental evaluation of seven algorithms. Proceedings of the VLDB Endowment, 2015, 8(10): 1082-1093.

    Article  Google Scholar 

  4. Cong G, Fan W, Geerts F, Jia X, Ma S. Improving data quality: Consistency and accuracy. In Proc. the 33rd International Conference on Very Large Data Bases, Sept. 2007, pp.315-326.

  5. Alwan A A, Ibrahim H, Udzir N I. Improved integrity constraints checking in distributed databases by exploiting local checking. Journal of Computer Science and Technology, 2009, 24(4): 665-674.

    Article  Google Scholar 

  6. Du Y, Shen D, Nie T, Kou Y, Yu G. Discovering condition-combined functional dependency rules. In Proc. the 16th APWeb, Sept. 2014, pp.247-257.

  7. Fan W, Li J, Tang N, Yu W. Incremental detection of inconsistencies in distributed data. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(6): 1367-1383.

    Article  Google Scholar 

  8. Fan W, Ma S, Tang N, Yu W. Interaction between record matching and data repairing. Journal of Data and Information Quality (JDIQ), 2014, 4(4): Article No. 16.

  9. Christen P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(9): 1537-1555.

    Article  Google Scholar 

  10. Li X, Dong X L, Lyons K, Meng W, Srivastava D. Truth finding on the deep web: Is the problem solved? Proceedings of the VLDB Endowment, 2012, 6(2): 97-108.

    Article  Google Scholar 

  11. Bohannon P, Fan W, Flaster M, Rastogi R. A cost-based model and effective heuristic for repairing constraints by value modification. In Proc. the 31st ACM SIGMOD International Conference on Management of Data, June 2005, pp.143-154.

  12. Wu A H, Tan Z J, Wang W. Annotation based query answer over inconsistent database. Journal of Computer Science and Technology, 2012, 25(3): 469-481.

    Article  Google Scholar 

  13. Chiang F, Miller R J. A unified model for data and constraint repair. In Proc. the 27th International Conference on Data Engineering, April 2011, pp.446-457.

  14. Maher M J. Constrained dependencies. Theoretical Computer Science, 1997, 173(1): 113-149.

    Article  MathSciNet  MATH  Google Scholar 

  15. Bravo L, Fan W, Ma S. Extending dependencies with conditions. In Proc. the 33rd International Conference on Very Large Data Bases, Sept. 2007, pp.243-254.

  16. Chu X, Ilyas I F, Papotti P. Discovering denial constraints. Proceedings of the VLDB Endowment, 2013, 6(13): 1498-1509.

    Article  Google Scholar 

  17. Wang J, Tang N. Towards dependable data repairing with fixing rules. In Proc. the 40th ACM SIGMOD International Conference on Management of Data, June 2014, pp.457-468.

  18. Wu L, Yuan L, You J. Survey of large-scale data management systems for big data applications. Journal of Computer Science and Technology, 2015, 30(1): 163-183.

    Article  Google Scholar 

  19. Chen Q, Tan Z, He C, Sha C, Wang W. Repairing functional dependency violations in distributed data. In Proc. the 20th Database Systems for Advanced Applications, April 2015, pp.441-457.

  20. Chiang F, Miller R J. Discovering data quality rules. Proceedings of the VLDB Endowment, 2008, 1(1): 1166-1177.

    Article  Google Scholar 

  21. Fan W, Geerts F, Ma S, Müller H. Detecting inconsistencies in distributed data. In Proc. the 26th International Conference on Data Engineering (ICDE), March 2010, pp.64-75.

  22. Dallachiesa M, Ebaid A, Eldawy A, Elmagarmid A, Ilyas I F, Ouzzani M, Tang N. NADEEF: A commodity data cleaning system. In Proc. the 39th ACM SIGMOD International Conference on Management of Data, June 2013, pp.541-552.

  23. Khayyat Z, Ilyas I F, Jindal A, Madden S, Ouzzani M, Papotti P, Quiané-Ruiz J A, Tang N, Yin S. BigDansing: A system for big data cleansing. In Proc. the 2015 ACM SIGMOD International Conference on Management of Data, May 2015, pp.1215-1230.

  24. Chomicki J, Marcinkowski J. Minimal-change integrity maintenance using tuple deletions. Information and Computation, 2005, 197(1/2): 90-121.

    Article  MathSciNet  MATH  Google Scholar 

  25. Cormen T H, Leiserson C E, Rivest R L, Stein C. Introduction to Algorithms. MIT Press and McGraw-Hill, 2001.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yue-Feng Du.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Du, YF., Shen, DR., Nie, TZ. et al. Content-Related Repairing of Inconsistencies in Distributed Data. J. Comput. Sci. Technol. 31, 741–758 (2016). https://doi.org/10.1007/s11390-016-1660-4

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-016-1660-4

Keywords

Navigation