Abstract
Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called matching dependencies (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this work we show the process and the benefits of integrating three components of ER: (a) Classifiers for duplicate/non-duplicate record pairs built using machine learning (ML) techniques, (b) MDs for supporting both the blocking phase of ML and the merge itself; and (c) The use of the declarative language LogiQL -an extended form of Datalog supported by the LogicBlox platform- for data processing, and the specification and enforcement of MDs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
http://academic.research.microsoft.com. For comparison, we also tested our system with data from DBLP and Cora.
- 5.
A more precise notation for the MD would be: \(\forall x_1^1 \cdots \forall y_2^m(\bigwedge _j R_1[x_1^j] \approx _j R_2[x_2^j] \ \longrightarrow \ \bigwedge _k R_1[y_1^k] \doteq R_2[y_2^k])\).
- 6.
These MDs are more general than those introduced in Sect. 2.1: they may contain regular database atoms, which are used to give context to the similarity atoms in the same antecedent.
- 7.
At this point, since all we want is to do blocking, and not yet decisions about duplicates, we could, in comparison with what is done with pairs in T, compute less similarity measures and even with low thresholds.
- 8.
Similarity computations are kept in appropriate program predicates. So similarity values computed before blocking can be reused at this stage, or whenever needed.
- 9.
The classifier also returns pairs or records that come from the same block, but are not considered to be duplicate. The set thereof in not interesting, at least as a workflow component.
- 10.
For our experiments, we independently used two other datasets: DBLP and Cora Citation.
- 11.
In LogiQL, each predicate has to be declared, unless it can be inferred from the rest of the program.
- 12.
- 13.
Actually, this natural condition makes the set of blocking-MDs interaction-free, i.e. for every two blocking-MDs \(m_1, m_2\), the set of attributes on the RHS of \(m_1\) and the set of attributes on the LHS of \(m_2\) on which there are similarity predicates, are disjoint [7].
- 14.
Notice that since we have interaction-free sets of blocking-MDs, stratified Datalog programs are expressive enough to express and enforce them [3]. LogiQL supports stratified Datalog.
- 15.
The features considered in a weight vector computation depend on whether they have a strong discrimination power, i.e. do not contain missing values.
References
Aref, M., ten Cate, B., Green, T.J., Kimelfeld, B., Olteanu, D., Pasalic, E., Veldhuizen, T.L., Washburn, G.: Design and Implementation of the LogicBlox System. In: Proceeding SIGMOD 2015, pp. 125–141 (2015)
Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Proceeding ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Identification , pp. 234–256 (2003)
Bahmani, Z., Bertossi, L., Kolahi, S., Lakshmanan, L.: Declarative entity resolution via matching dependencies and answer set programs. In: Proceeding KR 2012, pp. 380–390 (2012)
Baudat, G., Anouar, F.: Generalized discriminant analysis using a kernel approach. Neural Comput. 12(3), 2385–2404 (2000)
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., EuijongWhang, S., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)
Bertossi, L., Kolahi, S., Lakshmanan, L.: Data cleaning and query answering with matching dependencies and matching functions. In: Proceeding ICDT 2011. ACM Press (2011)
Bertossi, L., Kolahi, S., Lakshmanan, L.: Data cleaning and query answering with matching dependencies and matching functions. Thoer. Comp. Syst. 52(3), 441–482 (2013)
Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. 41(1), 1–41 (2008)
Ceri, S., Gottlob, G., Tanca, L.: Logic Programming and Databases. Springer, Heidelberg (1989)
Christen, P., Goiser, K.: Quality and complexity measures for data linkage and deduplication. In: Guillet, F., Hamilton, H. (eds.) Quality Measures in Data Mining. SCI, pp. 127–151. Springer, Heidelberg (2007)
Christen, P.: Automatic record linkage using seeded nearest neighbour and support vector machine classification. In: Proceeding SIGKDD 2008, pp. 151–159 (2008)
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2011)
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: Proceeding Workshop on Data Cleaning and Object Consolidation 2003, pp. 123–134 (2003)
Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theor. 13(1), 21–27 (1967)
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Fan, W.: Dependencies revisited for improving data quality. In: Proceeding PODS (2008)
Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about Record Matching Rules. PVLDB 2(1), 407–418 (2009)
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Soc. 64(1), 328–339 (1969)
Herzog, T.N., Scheuren, F.J., Winkler, W.E.: Data Quality and Record Linkage Techniques. Springer, New York (2007)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Rastogi, V., Dalvi, N.N., Garofalakis, M.N.: Large-scale collective entity matching. PVLDB 4(4), 208–218 (2011)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Euijong Whang, S., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: Proceeding SIGMOD 2009, pp. 219–232 (2009)
Vapnik, V.N.: Statistical Learning Theory. Wiley (1998)
Winkler, W.E.: The State of record linkage and currentresearch problems. Technical Report, U.S. Census Bureau (1999)
Acknowledgments
Part of this research was funded by an NSERC Discovery grant and the NSERC Strategic Network on Business Intelligence (BIN). Z. Bahmani and L. Bertossi are very much grateful for the support from LogicBlox during their internship and sabbatical visit.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Bahmani, Z., Bertossi, L., Vasiloglou, N. (2015). ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution. In: Beierle, C., Dekhtyar, A. (eds) Scalable Uncertainty Management. SUM 2015. Lecture Notes in Computer Science(), vol 9310. Springer, Cham. https://doi.org/10.1007/978-3-319-23540-0_27
Download citation
DOI: https://doi.org/10.1007/978-3-319-23540-0_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23539-4
Online ISBN: 978-3-319-23540-0
eBook Packages: Computer ScienceComputer Science (R0)