The VLDB Journal

, Volume 27, Issue 4, pp 497–519 | Cite as

Distilling relations using knowledge bases

  • Shuang Hao
  • Nan Tang
  • Guoliang LiEmail author
  • Jian Li
  • Jianhua Feng
Regular Paper


Given a relational table, we study the problem of detecting and repairing erroneous data, as well as marking correct data, using well curated knowledge bases (KBs). We propose detective rules (DRs), a new type of data cleaning rules that can make actionable decisions on relational data, by building connections between a relation and a KB. The main invention is that a DR simultaneously models two opposite semantics of an attribute belonging to a relation using types and relationships in a KB: The positive semantics explains how its value should be linked to other attribute values in a correct tuple, and the negative semantics indicate how a wrong attribute value is connected to other correct attribute values within the same tuple. Naturally, a DR can mark correct values in a tuple if it matches the positive semantics. Meanwhile, a DR can detect/repair an error if it matches the negative semantics. We study fundamental problems associated with DRs, e.g., rule consistency and rule implication. We present efficient algorithms to apply DRs to clean a relation, based on rule order selection and inverted indexes. Moreover, we discuss approaches on how to generate DRs from examples. Extensive experiments, using both real-world and synthetic datasets, verify the effectiveness and efficiency of applying DRs in practice.


Data cleaning Knowledge base Detective rule Rule generation 



This work was supported by the 973 Program of China (2015CB358700), NSF of China (61632016, 61472198, 61521002, 61661166012), and TAL education.


  1. 1.
    Abedjan, Z., Chu, X., Deng, D., Fernandez, R.C., Ilyas, I.F., Ouzzani, M., Papotti, P., Stonebraker, M., Tang, N.: Detecting data errors: where are we and what needs to be done? PVLDB 9(12), 993–1004 (2016)Google Scholar
  2. 2.
    Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Boston (1995)zbMATHGoogle Scholar
  3. 3.
    Anchuri, P., Zaki, M.J., Barkol, O., Golan, S., Shamy, M.: Approximate graph mining with label costs. In: KDD, pp. 518–526 (2013)Google Scholar
  4. 4.
    Arenas, M., Bertossi, L.E., Chomicki, J.: Consistent query answers in inconsistent databases. In: SIGMOD, pp. 68–79. ACM (1999)Google Scholar
  5. 5.
    Bach, S.H., Broecheler, M., Huang, B., Getoor, L.: Hinge-loss markov random fields and probabilistic soft logic. CoRR, arXiv:1505.04406 (2015)
  6. 6.
    Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)CrossRefGoogle Scholar
  7. 7.
    Bohannon, P., Fan, W., Flaster, M., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: SIGMOD (2005)Google Scholar
  8. 8.
    Chai, C., Li, G., Li, J., Deng, D., Feng, J.: Cost-effective crowdsourced entity resolution: a partial-order approach. In: SIGMOD, pp. 969–984 (2016)Google Scholar
  9. 9.
    Chiang, F., Miller, R.J.: A unified model for data and constraint repair. In: ICDE (2011)Google Scholar
  10. 10.
    Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: ICDE (2013)Google Scholar
  11. 11.
    Chu, X., Morcos, J., Ilyas, I.F., Ouzzani, M., Papotti, P., Tang, N., Ye, Y.: KATARA: a data cleaning system powered by knowledge bases and crowdsourcing. In: SIGMOD (2015)Google Scholar
  12. 12.
    Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: VLDB (2007)Google Scholar
  13. 13.
    Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A.K., Ilyas, I.F., Ouzzani, M., Tang, N.: NADEEF: a commodity data cleaning system. In: SIGMOD (2013)Google Scholar
  14. 14.
    Deng, D., Jiang, Y., Li, G., Li, J., Yu, C.: Scalable column concept determination for web tables using large knowledge bases. PVLDB 6(13), 1606–1617 (2013)Google Scholar
  15. 15.
    Deng, D., Li, G., Wen, H., Feng, J.: An efficient partition based method for exact set similarity joins. PVLDB 9(4), 360–371 (2015)Google Scholar
  16. 16.
    Deshpande, O., Lamba, D.S., Tourn, M., Das, S., Subramaniam, S., Rajaraman, A., Harinarayan, V., Doan, A.: Building, maintaining, and using knowledge bases: a report from the trenches. In: SIGMOD Conference (2013)Google Scholar
  17. 17.
    Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., Strohmann, T., Sun, S., Zhang, W.: Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: SIGKDD (2014)Google Scholar
  18. 18.
    Dong, X.L., Gabrilovich, E., Heitz, G., Horn, W., Murphy, K., Sun, S., Zhang, W.: From data fusion to knowledge fusion. PVLDB 7(10), 881–892 (2014)Google Scholar
  19. 19.
    Fan, W.: Dependencies revisited for improving data quality. In: PODS (2008)Google Scholar
  20. 20.
    Fan, W., Fan, Z., Tian, C., Dong, X.L.: Keys for graphs. PVLDB 8(12), 1590–1601 (2015)Google Scholar
  21. 21.
    Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. TODS 33(2), 6 (2008)CrossRefGoogle Scholar
  22. 22.
    Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. PVLDB 2(1), 407–418 (2009)Google Scholar
  23. 23.
    Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. VLDB J. 21(2), 213–238 (2012)CrossRefGoogle Scholar
  24. 24.
    Feng, J., Wang, J., Li, G.: Trie-join: a trie-based method for efficient string similarity joins. VLDB J. 21(4), 437–461 (2012)CrossRefGoogle Scholar
  25. 25.
    Geerts, F., Mecca, G., Papotti, P., Santoro, D.: The LLUNATIC data-cleaning framework. PVLDB 6(9), 625–636 (2013)Google Scholar
  26. 26.
    Hao, S., Tang, N., Li, G., Li, J.: Cleaning relations using knowledge bases. In: ICDE (2017)Google Scholar
  27. 27.
    He, J., Veltri, E., Santoro, D., Li, G., Mecca, G., Papotti, P., Tang, N.: Interactive and deterministic data cleaning. In: SIGMOD (2016)Google Scholar
  28. 28.
    Heer, J., Hellerstein, J.M., Kandel, S.: Predictive interaction for data transformation. In: CIDR (2015)Google Scholar
  29. 29.
    Herzog, T.N., Scheuren, F.J., Winkler, W.E.: Data Quality and Record Linkage Techniques. Springer, Berlin (2009)zbMATHGoogle Scholar
  30. 30.
    Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: YAGO2: A spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell. 194, 28–61 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  31. 31.
    Interlandi, M., Tang, N.: Proof positive and negative in data cleaning. In: ICDE (2015)Google Scholar
  32. 32.
    Jiang, Y., Li, G., Feng, J., Li, W.: String similarity joins: an experimental evaluation. PVLDB 7(8), 625–636 (2014)Google Scholar
  33. 33.
    Khayyat, Z., Ilyas, I.F., Jindal, A., Madden, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N., Yin, S.: Bigdansing: a system for big data cleansing. In: SIGMOD (2015)Google Scholar
  34. 34.
    Li, G.: A human-machine method for web table understanding. In: WAIM, pp. 179–189 (2013)Google Scholar
  35. 35.
    Li, G.: Human-in-the-loop data integration. PVLDB 10(12), 2006–2017 (2017)Google Scholar
  36. 36.
    Li, G., Chai, C., Fan, J., Weng, X., Li, J., Zheng, Y., Li, Y., Yu, X., Zhang, X., Yuan, H.: CDB: optimizing queries with crowd-based selections and joins. In: SIGMOD, pp. 1463–1478 (2017)Google Scholar
  37. 37.
    Li, G., Deng, D., Wang, J., Feng, J.: PASS-JOIN: a partition-based method for similarity joins. PVLDB 5(3), 253–264 (2011)Google Scholar
  38. 38.
    Li, G., Ooi, B.C., Feng, J., Wang, J., Zhou, L.: EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. In: SIGMOD, pp. 903–914 (2008)Google Scholar
  39. 39.
    Li, G., Wang, J., Zheng, Y., Franklin, M.J.: Crowdsourced data management: a survey. IEEE Trans. Knowl. Data Eng. 28(9), 2296–2319 (2016)CrossRefGoogle Scholar
  40. 40.
    Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. PVLDB 3(12), 1338–1347 (2010)Google Scholar
  41. 41.
    Morsey, M., Lehmann, J., Auer, S., Ngomo, A.N.: Dbpedia SPARQL benchmark—performance assessment with real queries on real data. In: ISWC (2011)Google Scholar
  42. 42.
    Niu, F., Ré, C., Doan, A., Shavlik, J.W.: Tuffy: Scaling up statistical inference in markov logic networks using an RDBMS. PVLDB 4(6), 373–384 (2011)Google Scholar
  43. 43.
    Raman, V., Hellerstein, J.M.: Potter’s wheel: an interactive data cleaning system. In: VLDB (2001)Google Scholar
  44. 44.
    Rekatsinas, T., Chu, X., Ilyas, I.F., Ré, C.: Holoclean Holistic data repairs with probabilistic inference. PVLDB 10(11), 1190–1201 (2017)Google Scholar
  45. 45.
    Shang, Z., Liu, Y., Li, G., Feng, J.: K-join: knowledge-aware similarity join. IEEE Trans. Knowl. Data Eng. 28(12), 3293–3308 (2016)CrossRefGoogle Scholar
  46. 46.
    Shin, J., Wu, S., Wang, F., Sa, C.D., Zhang, C., Ré, C.: Incremental knowledge base construction using deepdive. PVLDB 8(11), 1310–1321 (2015)Google Scholar
  47. 47.
    Singh, R., Meduri, V., Elmagarmid, A.K., Madden, S., Papotti, P., Quiané-Ruiz, J., Solar-Lezama, A., Tang, N.: Generating concise entity matching rules. In: PVLDB (2017)Google Scholar
  48. 48.
    Singh, R., Meduri, V., Elmagarmid, A.K., Madden, S., Papotti, P., Quiané-Ruiz, J., Solar-Lezama, A., Tang, N.: Synthesizing entity matching rules by examples. In: SIGMOD demo (2017)Google Scholar
  49. 49.
    Song, S., Cheng, H., Yu, J.X., Chen, L.: Repairing vertex labels under neighborhood constraints. PVLDB 7(11), 987–998 (2014)Google Scholar
  50. 50.
    Venetis, P., Halevy, A.Y., Madhavan, J., Pasca, M., Shen, W., Wu, F., Miao, G., Wu, C.: Recovering semantics of tables on the web. PVLDB 4(9), 528–538 (2011)Google Scholar
  51. 51.
    Volkovs, M., Chiang, F., Szlichta, J., Miller, R.J.: Continuous data cleaning. In: ICDE (2014)Google Scholar
  52. 52.
    Wang, J., Li, G., Feng, J.: Trie-join: efficient trie-based string similarity joins with edit-distance constraints. PVLDB 3(1), 1219–1230 (2010)Google Scholar
  53. 53.
    Wang, J., Li, G., Feng, J.: Fast-join: an efficient method for fuzzy token matching based string similarity join. In: Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11–16, 2011, Hannover, Germany, pp. 458–469 (2011)Google Scholar
  54. 54.
    Wang, J., Li, G., Kraska, T., Franklin, M.J., Feng, J.: Leveraging transitive relations for crowdsourced joins. In: SIGMOD, pp. 229–240 (2013)Google Scholar
  55. 55.
    Wang, J., Tang, N.: Towards dependable data repairing with fixing rules. In: SIGMOD (2014)Google Scholar
  56. 56.
    Yakout, M., Berti-Equille, L., Elmagarmid, A.K.: Don’t be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In: SIGMOD (2013)Google Scholar
  57. 57.
    Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. PVLDB 4(5), 279–289 (2011)Google Scholar
  58. 58.
    Yu, M., Wang, J., Li, G., Zhang, Y., Deng, D., Feng, J.: A unified framework for string similarity search with edit-distance constraint. VLDB J. 26(2), 249–274 (2017)CrossRefGoogle Scholar
  59. 59.
    Zhuang, Y., Li, G., Feng, Z.Z.J.: Hike: a hybrid human-machine method for entity alignment in large-scale knowledge bases. In: CIKM (2017)Google Scholar
  60. 60.
    Zhuang, Y., Li, G., Zhong, Z., Feng, J.: PBA: partition and blocking based alignment for large knowledge bases. In: DASFAA, pp. 415–431 (2016)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer Science and TechnologyTsinghua UniversityBeijingChina
  2. 2.Qatar Computing Research InstituteHBKUDohaQatar

Personalised recommendations