Skip to main content

One-Pass Inconsistency Detection Algorithms for Big Data

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9642))

Included in the following conference series:

Abstract

Data in the real world is often dirty. Inconsistency is an important kind of dirty data. Before repairing inconsistency, we need to detect them first. The time complexities of current inconsistency detection algorithms are super-linear to the size of data and not suitable for big data. For inconsistency detection for big data, we develop an algorithm that detects inconsistency within one-pass scan of the data according to both the functional dependency (FD) and the conditional functional dependency (CFD). We compare our detection algorithm with existing approaches experimentally. Experimental results on real datasets show that our approach could detect inconsistency effectively and efficiently.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The description of this dataset can be found from following website. http://archive.ics.uci.edu/ml/machine-learning-databases/census1990-mld/USCensus1990-desc.html.

References

  1. Wayne, W.E.: Data quality and the bottom line: achieving business success through a commitment to high quality data. In: TDWI report (2004)

    Google Scholar 

  2. Bohannon, P., Fan, W., Geerts, F., et al.: Conditional functional dependencies for data cleaning. In: ICDE, pp. 746–755 (2007)

    Google Scholar 

  3. Chen, W., Fan, W., Ma, S.: Analyses and validation of conditional dependencies with built-in predicates. In: Bhowmick, S.S., Küng, J., Wagner, R. (eds.) DEXA 2009. LNCS, vol. 5690, pp. 576–591. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  4. Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: VLDB, pp. 315–326 (2007)

    Google Scholar 

  5. Fan, W., Geerts, F., Tang, N., et al.: Inferring data currency and consistency for conflict resolution. In: ICDE, pp. 470–481 (2013)

    Google Scholar 

  6. Bohannon, P., Fan, W., Flaster, M., et al.: A cost-based model and effective heuristic for repairing constraints by value modification. In: SIGMOD, pp. 143–154 (2005)

    Google Scholar 

  7. Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: ICDE, pp. 458–469 (2013)

    Google Scholar 

  8. Kolahi, S., Lakshmanan, L.V.S.: On approximating optimum repairs for functional dependency violations. In: ICDT, pp. 53–62 (2009)

    Google Scholar 

  9. Yakout, M., Elmagarmid, A.K., et al.: Guided data repair. In: PVLDB, pp. 279–289 (2011)

    Google Scholar 

  10. Korn, F., Muthukrishnan, S., Zhu, Y.: Checks and balances: monitoring data quality problems in network traffic databases. In: VLDB, pp. 536–547 (2003)

    Google Scholar 

  11. Xiong, H., Pandey, G., Steinbach, M., et al.: Enhancing data analysis with noise removal. In: TKDE, pp. 304–319 (2006)

    Google Scholar 

  12. Fan, W., Geerts, F.: Foundations of Data Quality Management, Synthesis Lectures on Data Management, pp. 71–82 (2012)

    Google Scholar 

  13. Chiang, F., Miller, R.J.: Discovering data quality rules. In: VLDB, pp. 1166–1177 (2008)

    Google Scholar 

  14. Golab, L., Karloff, H., Korn, F., Srivastava, D., Yu, B.: On generating near-optimal tableaux for conditional functional dependencies. In: VLDB, pp. 1161–1172 (2008)

    Google Scholar 

  15. Fan, W., Geerts, F., Li, J., Xiong, M.: Discovering conditional functional dependencies. In: TKDE, pp. 683–698 (2011)

    Google Scholar 

  16. Geerts, F., Mecca, G., Papotti, P., Santoro, D.: The LLUNATIC data-cleaning framework. In: PVLDB, pp. 625–636 (2013)

    Google Scholar 

  17. Bertossi, L., Bravo, L., et al.: The complexity and approximation of fixing numerical attributes in databases under integrity constraints. In: Information Systems, pp. 407–434 (2008)

    Google Scholar 

  18. Fan, W., Li, J., Ma, S., et al.: Towards certain fixes with editing rules and master data. VLDB 3, 173–184 (2010)

    Google Scholar 

  19. Talukder, N., Ouzzani, M., Elmagarmid, A.K., et al.: Detecting inconsistencies in private data with secure function evaluation. Technical report, Purdue University (2011)

    Google Scholar 

  20. Demsky, B., Rinard, M.: Automatic detection and repair of errors in data structures. In: SIGPLAN Notices, pp. 78–95 (2003)

    Google Scholar 

Download references

Acknowledgment

This paper was supported by NGFR 973 grant 2012CB316200, NSFC grant U1509216,61472099,61133002 and National Sci-Tech Support Plan 2015 BAH10F01.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongzhi Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Zhang, M., Wang, H., Li, J., Gao, H. (2016). One-Pass Inconsistency Detection Algorithms for Big Data. In: Navathe, S., Wu, W., Shekhar, S., Du, X., Wang, X., Xiong, H. (eds) Database Systems for Advanced Applications. DASFAA 2016. Lecture Notes in Computer Science(), vol 9642. Springer, Cham. https://doi.org/10.1007/978-3-319-32025-0_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-32025-0_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-32024-3

  • Online ISBN: 978-3-319-32025-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics