Skip to main content

Support for User Involvement in Data Cleaning

  • Conference paper
Data Warehousing and Knowledge Discovery (DaWaK 2011)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6862))

Included in the following conference series:

Abstract

Data cleaning and ETL processes are usually modeled as graphs of data transformations. The involvement of the users responsible for executing these graphs over real data is important to tune data transformations and to manually correct data items that cannot be treated automatically. In this paper, in order to better support the user involvement in data cleaning processes, we equip a data cleaning graph with data quality constraints to help users identifying the points of the graph and the records that need their attention and manual data repairs for representing the way users can provide the feedback required to manually clean some data items. We provide preliminary experimental results that show the significant gains obtained with the use of data cleaning graphs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Arenas, M., Bertossi, L.E., Chomicki, J.: Consistent query answers in inconsistent databases. In: PODS, pp. 68–79 (1999)

    Google Scholar 

  2. Bohannon, A., Pierce, B.C., Vaughan, J.A.: Relational lenses: a language for updatable views. In: PODS, pp. 338–347. ACM, New York (2006)

    Google Scholar 

  3. Carreira, P., Galhardas, H., Lopes, A., Pereira, J.: One-to-many data transformations through data mappers. Data Knowl. Eng. 62(3), 483–503 (2007)

    Article  Google Scholar 

  4. Chai, X., Vuong, B.-Q., Doan, A., Naughton, J.F.: Efficiently incorporating user feedback into information extraction and integration programs. In: SIGMOD, pp. 87–100 (2009)

    Google Scholar 

  5. Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: Consistency and accuracy. In: VLDB, pp. 315–326 (2007)

    Google Scholar 

  6. Couto, F.M., Pesquita, C., Grego, T., Verissimo, P.: Handling self-citations using google scholar. Cybermetrics 13(1) (2009)

    Google Scholar 

  7. Fan, W., Geerts, F., Jia, X.: Conditional dependencies: A principled approach to improving data quality. In: Sexton, A.P. (ed.) BNCOD 26. LNCS, vol. 5588, pp. 8–20. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  8. Galhardas, H., Florescu, D., Shasha, D., Simon, E.: Ajax: An extensible data cleaning tool. In: SIGMOD, p. 590 (2000)

    Google Scholar 

  9. Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.-A.: Declarative data cleaning: Language, model, and algorithms. In: VLDB, pp. 371–380 (2001)

    Google Scholar 

  10. Galhardas, H., Lopes, A., Santos, E.: Support for user involvement in data cleaning applications. DI/FCUL TR 2010-03, Faculty of Sciences, University of Lisbon (2010), http://hdl.handle.net/10455/6674

  11. Garcia-Molina, H., Ullman, J., Widom, J.: Database Systems: The Complete Book. Prentice-Hall, Englewood Cliffs (2008)

    Google Scholar 

  12. Hall, P.A.V., Dowling, G.R.: Approximate string matching. ACM Comput. Surv. 12(4), 381–402 (1980)

    Article  Google Scholar 

  13. Kang, I.-S., Na, S.-H., Lee, S., Jung, H., Kim, P., Sung, W.-K., Lee, J.-H.: On co-authorship for author disambiguation. Inf. Process. Manage. 45(1), 84–97 (2009)

    Article  Google Scholar 

  14. Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)

    Google Scholar 

  15. Raman, V., Hellerstein, J.M.: Potter’s wheel: An interactive data cleaning system. In: VLDB, pp. 381–390 (2001)

    Google Scholar 

  16. Rodic, J., Baranovic, M.: Generating data quality rules and integration into etl process. In: DOLAP, pp. 65–72 (2009)

    Google Scholar 

  17. Simitsis, A., Vassiliadis, P., Terrovitis, M., Skiadopoulos, S.: Graph-based modeling of ETL activities with multi-level transformations and updates. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2005. LNCS, vol. 3589, pp. 43–52. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  18. Vassiliadis, P., Simitsis, A., Georgantas, P., Terrovitis, M., Skiadopoulos, S.: A generic and customizable framework for the design of ETL scenarios. Inf. Syst. 30(7), 492–525 (2005)

    Article  Google Scholar 

  19. Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. PVLDB 4(5), 279–289 (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Galhardas, H., Lopes, A., Santos, E. (2011). Support for User Involvement in Data Cleaning. In: Cuzzocrea, A., Dayal, U. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2011. Lecture Notes in Computer Science, vol 6862. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23544-3_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23544-3_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23543-6

  • Online ISBN: 978-3-642-23544-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics