Skip to main content

Big Data Cleaning

  • Conference paper
Web Technologies and Applications (APWeb 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8709))

Included in the following conference series:

Abstract

Data cleaning is, in fact, a lively subject that has played an important part in the history of data management and data analytics, and it still is undergoing rapid development. Moreover, data cleaning is considered as a main challenge in the era of big data, due to the increasing volume, velocity and variety of data in many applications. This paper aims to provide an overview of recent work in different aspects of data cleaning: error detection methods, data repairing algorithms, and a generalized data cleaning system. It also includes some discussion about our efforts of data cleaning methods from the perspective of big data, in terms of volume, velocity and variety.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Dirty data costs the U.S. economy $3 trillion+ per year, http://www.ringlead.com/dirty-data-costs-economy-3-trillion/

  2. Firms full of dirty data, http://www.itpro.co.uk/609057/firms-full-of-dirty-data

  3. Arenas, M., Bertossi, L.E., Chomicki, J.: Consistent query answers in inconsistent databases. TPLP (2003)

    Google Scholar 

  4. Bertossi, L.E., Kolahi, S., Lakshmanan, L.V.S.: Data cleaning and query answering with matching dependencies and matching functions. In: ICDT (2011)

    Google Scholar 

  5. Beskales, G., Das, G., Elmagarmid, A.K., Ilyas, I.F., Naumann, F., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N.: The data analytics group at the qatar computing research institute. SIGMOD Record 41(4), 33–38 (2012)

    Article  Google Scholar 

  6. Beskales, G., Ilyas, I.F., Golab, L.: Sampling the repairs of functional dependency violations under hard constraints. PVLDB (2010)

    Google Scholar 

  7. Beskales, G., Soliman, M.A., Ilyas, I.F., Ben-David, S.: Modeling and querying possible repairs in duplicate detection. In: VLDB (2009)

    Google Scholar 

  8. Bohannon, P., Fan, W., Flaster, M., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: SIGMOD (2005)

    Google Scholar 

  9. Bravo, L., Fan, W., Ma, S.: Extending dependencies with conditions. In: VLDB (2007)

    Google Scholar 

  10. Chomicki, J., Marcinkowski, J.: Minimal-change integrity maintenance using tuple deletions. Inf. Comput. (2005)

    Google Scholar 

  11. Chu, X., Ilyas, I.F., Papotti, P.: Discovering denial constraints. PVLDB 6(13) (2013)

    Google Scholar 

  12. Chu, X., Papotti, P., Ilyas, I.F.: Holistic data cleaning: Put violations into context. In: ICDE (2013)

    Google Scholar 

  13. Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: Consistency and accuracy. In: VLDB (2007)

    Google Scholar 

  14. Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A.K., Ilyas, I.F., Ouzzani, M., Tang, N.: Nadeef: a commodity data cleaning system. In: SIGMOD (2013)

    Google Scholar 

  15. Ebaid, A., Elmagarmid, A.K., Ilyas, I.F., Ouzzani, M., Quiané-Ruiz, J.-A., Tang, N., Yin, S.: Nadeef: A generalized data cleaning system. PVLDB (2013)

    Google Scholar 

  16. Elmagarmid, A., Ilyas, I.F., Ouzzani, M., Quiane-Ruiz, J., Tang, N., Yin, S.: NADEEF/ER: Generic and interactive entity resolution. In: SIGMOD (2014)

    Google Scholar 

  17. Fan, W.: Dependencies revisited for improving data quality. In: PODS (2008)

    Google Scholar 

  18. Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. TODS (2008)

    Google Scholar 

  19. Fan, W., Geerts, F., Li, J., Xiong, M.: Discovering conditional functional dependencies. IEEE Trans. Knowl. Data Eng. 23(5), 683–698 (2011)

    Article  Google Scholar 

  20. Fan, W., Geerts, F., Tang, N., Yu, W.: Inferring data currency and consistency for conflict resolution. In: ICDE (2013)

    Google Scholar 

  21. Fan, W., Geerts, F., Wijsen, J.: Determining the currency of data. In: PODS (2011)

    Google Scholar 

  22. Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. PVLDB 3(1), 173–184 (2010)

    Google Scholar 

  23. Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Interaction between record matching and data repairing. In: SIGMOD (2011)

    Google Scholar 

  24. Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. VLDB J. (2012)

    Google Scholar 

  25. Fan, W., Li, J., Tang, N., Yu, W.: Incremental detection of inconsistencies in distributed data. In: ICDE, pp. 318–329 (2012)

    Google Scholar 

  26. Fellegi, I., Holt, D.: A systematic approach to automatic edit and imputation. J. American Statistical Association (1976)

    Google Scholar 

  27. Kolahi, S., Lakshmanan, L.: On approximating optimum repairs for functional dependency violations. In: ICDT (2009)

    Google Scholar 

  28. Mayfield, C., Neville, J., Prabhakar, S.: ERACER: a database approach for statistical inference and data cleaning. In: SIGMOD (2010)

    Google Scholar 

  29. Raman, V., Hellerstein, J.M.: Potter’s Wheel: An interactive data cleaning system. In: VLDB (2001)

    Google Scholar 

  30. Wang, J., Tang, N.: Towards dependable data with fixing rules. In: SIGMOD (2014)

    Google Scholar 

  31. Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. PVLDB (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Tang, N. (2014). Big Data Cleaning. In: Chen, L., Jia, Y., Sellis, T., Liu, G. (eds) Web Technologies and Applications. APWeb 2014. Lecture Notes in Computer Science, vol 8709. Springer, Cham. https://doi.org/10.1007/978-3-319-11116-2_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11116-2_2

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11115-5

  • Online ISBN: 978-3-319-11116-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics