Abstract
Entity resolution (ER) - the process of identifying records that refer to the same real-world entity - pervasively exists in many application areas. Nevertheless, resolving entities is hardly ever completely accurate. In this paper, we investigate a provenance-aware framework for ER. We first propose an indexing structure that can be efficiently built for provenance storage in support of an ER process. Then a generic repairing strategy, called coordinate-split-merge (CSM), is developed to control the interaction between repairs driven by must-link and cannot-link constraints. Our experimental results show that the proposed indexing structure is efficient for capturing the provenance of ER both in time and space, which is also linearly scalable over the number of matches. Our repairing algorithms can significantly reduce human efforts in leveraging the provenance of ER for identifying erroneous matches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley (1995)
Afrati, F.N., Kolaitis, P.G.: Repair checking in inconsistent databases: algorithms and complexity. In: ICDT, pp. 31–41 (2009)
Agrawal, P., Ikeda, R., Park, H., Widom, J.: Trio-ER: The Trio system as a workbench for entity-resolution. Technical report, Stanford InfoLab (2009)
Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: SIGMOD, pp. 783–794 (2010)
Arasu, A., Ré, C., Suciu, D.: Large-scale deduplication with constraints using dedupalog. In: ICDE, pp. 952–963 (2009)
Benjelloun, O., Sarma, A.D., Halevy, A., Theobald, M., Widom, J.: Databases with uncertainty and lineage. The VLDB Journal 17(2), 243–264 (2008)
Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. TKDD 1(1), 5 (2007)
Buneman, P., Khanna, S., Tan, W.-C.: Why and where: a characterization of data provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 316–330. Springer, Heidelberg (2000)
Buneman, P., Tan, W.-C.: Provenance in databases. In: SIGMOD, pp. 1171–1173 (2007)
Chaudhuri, S., Das Sarma, A., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: SIGMOD, pp. 437–448 (2007)
Christen, P.: Data Matching. Springer (2012)
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE TKDE 24(9), 1537–1555 (2012)
Cohen, W.: Data integration using similarity joins and a word-based information representation language. TOIS 18(3), 288–321 (2000)
Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: PVLDB, pp. 315–326 (2007)
Fellegi, I., Sunter, A.: A theory for record linkage. J. Amer. Statistical Assoc. 64(328), 1183–1210 (1969)
Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann (2006)
Karvounarakis, G., Ives, Z.G., Tannen, V.: Querying data provenance. In: SIGMOD, pp. 951–962 (2010)
Newcombe, H., Kennedy, J.: Record linkage: making maximum use of the discriminating power of identifying information. Comm. of the ACM 5(11)
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: KDD (2002)
Shen, W., Li, X., Doan, A.: Constraint-based entity matching. In: AAAI, pp. 862–867 (2005)
Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance in e-science. ACM SIGMOD Record 34(3), 31–36 (2005)
Verykios, V., Moustakides, G., Elfeky, M.: A Bayesian decision model for cost optimal record matching. The VLDB Journal 12(1), 28–40 (2003)
Wagstaff, K., Cardie, C.: Clustering with instance-level constraints. In: AAAI, pp. 1097 (2000)
Whang, S.E., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with negative rules. The VLDB Journal 18(6), 1261–1277 (2009)
Wijsen, J.: Database repairing using updates. TODS 30(3), 722–768 (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Wang, Q., Schewe, KD., Wang, W. (2015). Provenance-Aware Entity Resolution: Leveraging Provenance to Improve Quality. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science(), vol 9049. Springer, Cham. https://doi.org/10.1007/978-3-319-18120-2_28
Download citation
DOI: https://doi.org/10.1007/978-3-319-18120-2_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18119-6
Online ISBN: 978-3-319-18120-2
eBook Packages: Computer ScienceComputer Science (R0)