Advertisement

Cleaning data with Llunatic

  • Floris Geerts
  • Giansalvatore Mecca
  • Paolo PapottiEmail author
  • Donatello Santoro
Regular Paper

Abstract

Data cleaning (or data repairing) is considered a crucial problem in many database-related tasks. It consists in making a database consistent with respect to a given set of constraints. In recent years, repairing methods have been proposed for several classes of constraints. These methods, however, tend to hard-code the strategy to repair conflicting values and are specialized toward specific classes of constraints. In this paper, we develop a general chase-based repairing framework, referred to as Llunatic, in which repairs can be obtained for a large class of constraints and by using different strategies to select preferred values. The framework is based on an elegant formalization in terms of labeled instances and partially ordered preference labels. In this context, we revisit concepts such as upgrades, repairs and the chase. In Llunatic, various repairing strategies can be slotted in, without the need for changing the underlying implementation. Furthermore, Llunatic is the first data repairing system which is DBMS-based. We report experimental results that confirm its good scalability and show that various instantiations of the framework result in repairs of good quality.

Keywords

Data quality Data cleaning Data repairing Chase Data repairing system Constraints Rules Repair algorithm Cleaning rules Dependencies Error detection 

Notes

Funding

Paolo Papotti has been partially supported by Agence Nationale de la Recherche (Grant No. ANR-18-CE23-0019).

References

  1. 1.
    Abedjan, Z., Chu, X., Deng, D., Fernandez, R.C., Ilyas, I.F., Ouzzani, M., Papotti, P., Stonebraker, M., Tang, N.: Detecting data errors: Where are we and what needs to be done? PVLDB 9(12), 993–1004 (2016)Google Scholar
  2. 2.
    Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Boston (1995)zbMATHGoogle Scholar
  3. 3.
    Arocena, P.C., Glavic, B., Mecca, G., Miller, R.J., Papotti, P., Santoro, D.: Messing up with BART: error generation for evaluating data-cleaning algorithms. PVLDB 9(2), 36–47 (2015)Google Scholar
  4. 4.
    Beeri, C., Vardi, M.: A proof procedure for data dependencies. J. ACM 31(4), 718–741 (1984)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Benedikt, M., Konstantinidis, G., Mecca, G., Motik, B., Papotti, P., Santoro, D., Tsamoura, E.: Benchmarking the chase. In: PODS, pp. 37–52 (2017)Google Scholar
  6. 6.
    Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)CrossRefGoogle Scholar
  7. 7.
    Bertossi, L.: Database Repairing and Consistent Query Answering. Morgan & Claypool, San Rafael (2011)CrossRefGoogle Scholar
  8. 8.
    Bertossi, L., Kolahi, S., Lakshmanan, L.: Data cleaning and query answering with matching dependencies and matching functions. In: ICDT, pp. 268–279 (2011)Google Scholar
  9. 9.
    Beskales, G., Ilyas, I.F., Golab, L.: Sampling the repairs of functional dependency violations under hard constraints. PVLDB 3, 197–207 (2010)Google Scholar
  10. 10.
    Bleifuß, T., Kruse, S., Naumann, F.: Efficient denial constraint discovery with Hydra. Proc. VLDB Endow. 11(3), 311–323 (2017)CrossRefGoogle Scholar
  11. 11.
    Bohannon, P., Flaster, M., Fan, W., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: SIGMOD, pp. 143–154 (2005)Google Scholar
  12. 12.
    Cao, Y., Fan, W., Yu, W.: Determining the relative accuracy of attributes. In: SIGMOD, pp. 565–576 (2013)Google Scholar
  13. 13.
    Caroprese, L., Greco, S., Zumpano, E.: Active integrity constraints for database consistency maintenance. IEEE Trans. Knowl. Data Eng. 21(7), 1042–1058 (2009)CrossRefGoogle Scholar
  14. 14.
    Chiang, F., Miller, R.J.: A unified model for data and constraint repair. In: ICDE (2011)Google Scholar
  15. 15.
    Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: Overview and emerging challenges. In: SIGMOD, pp. 2201–2206 (2016)Google Scholar
  16. 16.
    Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: ICDE, pp. 458–469 (2013)Google Scholar
  17. 17.
    Chu, X., Morcos, J., Ilyas, I.F., Ouzzani, M., Papotti, P., Tang, N., Ye, Y.: KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In: SIGMOD, pp. 1247–1261 (2015)CrossRefGoogle Scholar
  18. 18.
    Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: VLDB, pp. 315–326 (2007)Google Scholar
  19. 19.
    Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A.K., Ilyas, I., Ouzzani, M., Tang, N.: Nadeef: a commodity data cleaning system. In: SIGMOD, pp. 541–552 (2013)Google Scholar
  20. 20.
    Deng, D., Tao, W., Abedjan, Z., Elmagarmid, A.K., Ilyas, I.F., Madden, S., Ouzzani, M., Stonebraker, M., Tang, N.: Entity consolidation: the golden record problem. CoRR arXiv:1709.10436 (2017)
  21. 21.
    Experian: White paper: The data quality benchmark report (2015)Google Scholar
  22. 22.
    Fagin, R., Kolaitis, P., Miller, R., Popa, L.: Data exchange: semantics and query answering. TCS 336(1), 89–124 (2005)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Fan, W., Gao, H., Jia, X., Li, J., Ma, S.: Dynamic constraints for record matching. VLDB J. 20(4), 495–520 (2011)CrossRefGoogle Scholar
  24. 24.
    Fan, W., Geerts, F.: Foundations of Data Quality Management. Morgan & Claypool, San Rafael (2012)CrossRefGoogle Scholar
  25. 25.
    Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. ACM TODS 33, 6 (2008)Google Scholar
  26. 26.
    Fan, W., Geerts, F., Li, J., Xiong, M.: Discovering conditional functional dependencies. IEEE Trans. Knowl. Data Eng. 23(5), 683–698 (2011)CrossRefGoogle Scholar
  27. 27.
    Fan, W., Geerts, F., Ma, S., Müller, H.: Detecting inconsistencies in distributed data. In: Proceedings of the 26th International Conference on Data Engineering, ICDE, pp. 64–75 (2010)Google Scholar
  28. 28.
    Fan, W., Geerts, F., Wijsen, J.: Determining the currency of data. In: PODS, pp. 71–82 (2011)Google Scholar
  29. 29.
    Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. PVLDB 3(1), 173–184 (2010)Google Scholar
  30. 30.
    Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Interaction between record matching and data repairing. In: SIGMOD, pp. 469–480 (2011)Google Scholar
  31. 31.
    Geerts, F., Mecca, G., Papotti, P., Santoro, D.: The llunatic data-cleaning framework. PVLDB 6(9), 625–636 (2013)Google Scholar
  32. 32.
    Geerts, F., Mecca, G., Papotti, P., Santoro, D.: Mapping and cleaning. In: ICDE, pp. 232–243 (2014)Google Scholar
  33. 33.
    He, J., Veltri, E., Santoro, D., Li, G., Mecca, G., Papotti, P., Tang, N.: Interactive and deterministic data cleaning. In: SIGMOD, pp. 893–907 (2016)Google Scholar
  34. 34.
    Hernández, M., Koutrika, G., Krishnamurthy, R., Popa, L., Wisnesky, R.: Hil: a high-level scripting language for entity integration. In: EDBT, pp. 549–560 (2013)Google Scholar
  35. 35.
    Huhtala, Y., Kärkkäinen, J., Pasi Porkka, P., Toivonen, H.: TANE: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2), 100–111 (1999)CrossRefGoogle Scholar
  36. 36.
    Ilyas, I.F.: Effective data cleaning with continuous evaluation. IEEE Data Eng. Bull. 39(2), 38–46 (2016)Google Scholar
  37. 37.
    Ilyas, I.F., Chu, X.: Trends in cleaning relational data: consistency and deduplication. Found. Trends Databases 5(4), 281–393 (2015)CrossRefGoogle Scholar
  38. 38.
    Imieliński, T., Lipski, W.: Incomplete information in relational databases. J. ACM 31(4), 761–791 (1984)MathSciNetCrossRefGoogle Scholar
  39. 39.
    Khayyat, Z., Ilyas, I.F., Jindal, A., Madden, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J., Tang, N., Yin, S.: Bigdansing: a system for big data cleansing. In: SIGMOD, pp. 1215–1230 (2015)Google Scholar
  40. 40.
    Kimelfeld, B., Livshits, E., Peterfreund, L.: Detecting ambiguity in prioritized database repairing. In: ICDT, pp. 17:1–17:20 (2017)Google Scholar
  41. 41.
    Kolahi, S., Lakshmanan, L.V.S.: On approximating optimum repairs for functional dependency violations. In: ICDT, pp. 53–62 (2009)Google Scholar
  42. 42.
    Koudas, N., Saha, A., Srivastava, D., Venkatasubramanian, S.: Metric functional dependencies. In: ICDE, pp. 1275–1278 (2009)Google Scholar
  43. 43.
    Loshin, D.: Master Data Management. Knowl. Integrity, Inc., Washington, DC (2009)zbMATHGoogle Scholar
  44. 44.
    Marnette, B., Mecca, G., Papotti, P., Raunich, S., Santoro, D.: ++Spicy: an opensource tool for second-generation schema mapping and data exchange. PVLDB 4(12), 1438–1441 (2011)Google Scholar
  45. 45.
    Papenbrock, T., Naumann, F.: A hybrid approach to functional dependency discovery. In: SIGMOD, pp. 821–833 (2016)Google Scholar
  46. 46.
    Rammelaere, J., Geerts, F.: Revisiting conditional functional dependency discovery: splitting the “c” from the “fd”. In: ECML/PKDD, pp. 552–568 (2018)CrossRefGoogle Scholar
  47. 47.
    Rekatsinas, T., Chu, X., Ilyas, I.F., Ré, C.: Holoclean: holistic data repairs with probabilistic inference. PVLDB 10(11), 1190–1201 (2017)Google Scholar
  48. 48.
    Saha, B., Srivastava, D.: Data quality: the other face of big data. In: ICDE, pp. 1294–1297 (2014)Google Scholar
  49. 49.
    Song, S., Chen, L.: Differential dependencies: reasoning and discovery. ACM Trans. Database Syst. 36(3), 16 (2011)CrossRefGoogle Scholar
  50. 50.
    Staworko, S., Chomicki, J., Marcinkowski, J.: Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell. 64(2–3), 209–246 (2012)MathSciNetCrossRefGoogle Scholar
  51. 51.
    Volkovs, M., Chiang, F., Szlichta, J., Miller, R.J.: Continuous data cleaning. In: ICDE (2014)Google Scholar
  52. 52.
    Wang, J., Tang, N.: Towards dependable data repairing with fixing rules. In: SIGMOD (2014)Google Scholar
  53. 53.
    Wijsen, J.: Database repairing using updates. ACM Trans. Database Syst. 30(3), 722–768 (2005)CrossRefGoogle Scholar
  54. 54.
    Yakout, M., Berti-Équille, L., Elmagarmid, A.K.: Don’t be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In: SIGMOD, pp. 553–564 (2013)Google Scholar
  55. 55.
    Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. PVLDB 4(5), 279–289 (2011)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.University of AntwerpAntwerpBelgium
  2. 2.Università della BasilicataPotenzaItaly
  3. 3.EurecomBiotFrance

Personalised recommendations