Skip to main content

Provenance-Aware Entity Resolution: Leveraging Provenance to Improve Quality

  • Conference paper
  • First Online:
Book cover Database Systems for Advanced Applications (DASFAA 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9049))

Included in the following conference series:

Abstract

Entity resolution (ER) - the process of identifying records that refer to the same real-world entity - pervasively exists in many application areas. Nevertheless, resolving entities is hardly ever completely accurate. In this paper, we investigate a provenance-aware framework for ER. We first propose an indexing structure that can be efficiently built for provenance storage in support of an ER process. Then a generic repairing strategy, called coordinate-split-merge (CSM), is developed to control the interaction between repairs driven by must-link and cannot-link constraints. Our experimental results show that the proposed indexing structure is efficient for capturing the provenance of ER both in time and space, which is also linearly scalable over the number of matches. Our repairing algorithms can significantly reduce human efforts in leveraging the provenance of ER for identifying erroneous matches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley (1995)

    Google Scholar 

  2. Afrati, F.N., Kolaitis, P.G.: Repair checking in inconsistent databases: algorithms and complexity. In: ICDT, pp. 31–41 (2009)

    Google Scholar 

  3. Agrawal, P., Ikeda, R., Park, H., Widom, J.: Trio-ER: The Trio system as a workbench for entity-resolution. Technical report, Stanford InfoLab (2009)

    Google Scholar 

  4. Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: SIGMOD, pp. 783–794 (2010)

    Google Scholar 

  5. Arasu, A., Ré, C., Suciu, D.: Large-scale deduplication with constraints using dedupalog. In: ICDE, pp. 952–963 (2009)

    Google Scholar 

  6. Benjelloun, O., Sarma, A.D., Halevy, A., Theobald, M., Widom, J.: Databases with uncertainty and lineage. The VLDB Journal 17(2), 243–264 (2008)

    Article  Google Scholar 

  7. Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. TKDD 1(1), 5 (2007)

    Article  Google Scholar 

  8. Buneman, P., Khanna, S., Tan, W.-C.: Why and where: a characterization of data provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 316–330. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  9. Buneman, P., Tan, W.-C.: Provenance in databases. In: SIGMOD, pp. 1171–1173 (2007)

    Google Scholar 

  10. Chaudhuri, S., Das Sarma, A., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: SIGMOD, pp. 437–448 (2007)

    Google Scholar 

  11. Christen, P.: Data Matching. Springer (2012)

    Google Scholar 

  12. Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE TKDE 24(9), 1537–1555 (2012)

    Google Scholar 

  13. Cohen, W.: Data integration using similarity joins and a word-based information representation language. TOIS 18(3), 288–321 (2000)

    Article  Google Scholar 

  14. Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: PVLDB, pp. 315–326 (2007)

    Google Scholar 

  15. Fellegi, I., Sunter, A.: A theory for record linkage. J. Amer. Statistical Assoc. 64(328), 1183–1210 (1969)

    Article  Google Scholar 

  16. Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann (2006)

    Google Scholar 

  17. Karvounarakis, G., Ives, Z.G., Tannen, V.: Querying data provenance. In: SIGMOD, pp. 951–962 (2010)

    Google Scholar 

  18. Newcombe, H., Kennedy, J.: Record linkage: making maximum use of the discriminating power of identifying information. Comm. of the ACM 5(11)

    Google Scholar 

  19. Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: KDD (2002)

    Google Scholar 

  20. Shen, W., Li, X., Doan, A.: Constraint-based entity matching. In: AAAI, pp. 862–867 (2005)

    Google Scholar 

  21. Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance in e-science. ACM SIGMOD Record 34(3), 31–36 (2005)

    Article  Google Scholar 

  22. Verykios, V., Moustakides, G., Elfeky, M.: A Bayesian decision model for cost optimal record matching. The VLDB Journal 12(1), 28–40 (2003)

    Article  Google Scholar 

  23. Wagstaff, K., Cardie, C.: Clustering with instance-level constraints. In: AAAI, pp. 1097 (2000)

    Google Scholar 

  24. Whang, S.E., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with negative rules. The VLDB Journal 18(6), 1261–1277 (2009)

    Article  Google Scholar 

  25. Wijsen, J.: Database repairing using updates. TODS 30(3), 722–768 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qing Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Wang, Q., Schewe, KD., Wang, W. (2015). Provenance-Aware Entity Resolution: Leveraging Provenance to Improve Quality. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science(), vol 9049. Springer, Cham. https://doi.org/10.1007/978-3-319-18120-2_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-18120-2_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-18119-6

  • Online ISBN: 978-3-319-18120-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics