Synonyms
Entity matching; Object deduplication; Record linkage; Reference reconciliation
Definition
Let \( \mathcal {E} \) denote a set of entities in a domain, described using a set of attributes \( {\mathcal {A}} \). Each entity \( E \in {\mathcal {E}} \) is associated with zero, one, or more values for each attribute \( A \in {\mathcal {A}} \). For each entity in \( {\mathcal {E}} \), there can be a set of records \( \mathcal {R} \), provided by one or more sources over the attributes \( {\mathcal {A}} \), where each record provides at most one value for an attribute. We consider atomic values (string, number, date, time, etc.) as attribute values, and allow multiple representations of the same value, as well as erroneous values, in records. Entity resolution takes as input the records provided by the sources and decides which records refer to the same entity; in particular, it computes a partitioning \( {\mathcal {P}} \) of \( {\mathcal {R}} \), such that records in each partition refer to the same entity, and records in different partitions refer to different entities.
Historical Background
The problem of entity resolution, originally identified by Newcombe et al. [1], was first formalized by Fellegi and Sunter [2]. Thereafter the problem has been extensively studied and a large number of approaches have been proposed. Notably, the literature on heterogeneous databases had focused on schema-centric approaches assuming a uniform representation of individual entities until the seminal paper [3], which addressed the entity-resolution problem as a core issue of data integration.
Early approaches for entity resolution focus on pairwise record matching, which consists of two components. The first component computes a vector of similarity scores for individual record pairs by comparing their attributes; a comprehensive survey can be found in [4]. The second component declares a candidate record pair as a match or non-match based on the similarity vector. A variety of methods are proposed for this component, including rule-based methods [5], classification-based methods [6, 7], and distance-based methods [8].
Subsequent proposals focus on improving entity resolution by leveraging a global view of the data. First, clustering algorithms have been proposed to resolve inconsistent decisions for pairs of records (e.g., deciding a match between R1 and R2, a match between R2 and R3, and a non-match between R1 and R3); a comprehensive comparison of these methods can be found in [9]. Second, collective entity resolution was proposed for related entities of multiple types; entity relationships such as co-author and co-citation are considered when computing record similarity [10]. Third, constraints for the entity domain were considered as extra evidence for reconciliation [11]. Fourth, new approaches have been proposed to reconcile entities being aware of possible value varieties, such as incorrect values [12] and evolving values [13].
To complement the previously described approaches that focus on improving the effectiveness of entity resolution, strategies have been proposed to improve efficiency and scalability of entity resolution, such as Canopy [14]. This thread has been revived recently for big data management, where algorithms have been proposed for load-balancing in a MapReduce based framework [15], for incremental entity resolution [16], and so on.
Scientific Fundamentals
Entity resolution consists of three steps: blocking, pairwise matching, and clustering. Among them, pairwise matching and clustering are used to ensure the semantics of entity resolution, while blocking is used to achieve scalability.
Pairwise Matching
The basic step of entity resolution is pairwise matching, which compares a pair of records and makes a local decision of whether or not they refer to the same entity. A variety of techniques have been proposed for this step.
Rule-based approaches [5, 17] have been commonly used for this step in practice, and apply domain knowledge to make the local decision. The advantage of this approach is that the rules can be tailored to effectively deal with complex matching scenarios. However, a significant disadvantage of this approach is that it requires considerable domain knowledge as well as knowledge about the data to formulate the pairwise matching rules, rendering it infeasible when the domain and the data exhibit considerable heterogeneity.
Classification-based approaches have also been used for this step since the seminal paper by Fellegi and Sunter [2], wherein a classifier is built using positive and negative training examples, and the classifier decides whether a pair of records is a match or a non-match; it is also possible for the classifier to output a “possible match”, in which case the local decision is turned over to a human. Such classification-based machine learning approaches have the advantage that they do not require significant domain knowledge, but only knowledge of whether a pair of records in the training data refers to the same entity or not. A disadvantage of this approach is that it often requires a large number of training examples to accurately train the classifier, though active learning based variations [7] are often effective at reducing the volume of training data that is needed.
Finally, distance-based approaches [8] compute record-level distance as the weighted sum of similarity of corresponding attribute values (e.g., using Levenstein distance for computing similarity of strings, and Euclidean distance for computing similarity of numeric attributes). Low and high thresholds are used to declare matches, non-matches and possible matches. A key advantage of this approach is that the domain knowledge is limited to formulating distance metrics on atomic attributes, which can be potentially reused for a large variety of entity domains. A disadvantage of this approach is that it is a blunt hammer, which often requires careful parameter tuning (e.g., what should the weights on individual attributes be for the weighted sum, what should the low and high thresholds be), though machine learning approaches can often be used to tune the parameters in a principled fashion.
Clustering
As mentioned previously, the local decisions of match or non-match made by the pairwise matchings step may not be globally consistent. The purpose of the clustering step is to reach a globally consistent decision of partitioning the set of all records such that each partition refers to a distinct entity.
This step first constructs a pairwise matching graph G, where each node corresponds to a distinct record Ri, and there is an undirected edge (Ri, Rj) if and only if the pairwise matching step declared a match between Ri and Rj. A clustering of G partitions the nodes into pairwise disjoint subsets based on the edges that are present in G. There exists a wealth of literature on clustering algorithms for entity resolution [9]. These clustering algorithms tend not to constrain the number of clusters in the output, since the number of entities in the data set is typically not known a priori.
One of the simplest graph clustering strategies is to efficiently cluster graph G into connected components by a single scan of the edges in the graph [5]. Essentially, this strategy places a high trust on the local “match” decision, so even a few erroneous match decisions can significantly alter the results of entity resolution.
At the other extreme, a robust but expensive graph clustering algorithm is correlation clustering [18]. The goal of correlation clustering is to find a partitioning of nodes in G that agrees as much as possible with the edges in G. To achieve this goal, one can either maximize agreements or minimize disagreements between the clustering and the edges. The two strategies are equivalent from the optimization point of view but differ from the approximation point of view. When minimizing disagreements, for each pair of nodes in the same cluster that are not connected by an edge, there is a cohesion penalty of 1; for each pair of nodes in different clusters with an edge, there is a correlation penalty of 1. Correlation clustering seeks to compute the clustering that minimizes the overall sum of the penalties. It has been proved that correlation clustering is NP-complete, and many efficient approximation algorithms have been proposed for this problem.
Blocking
Pairwise matching and clustering together ensure the desired semantics of entity resolution, but may be quite inefficient and even infeasible for a large set of records. The main source of inefficiency is that pairwise matching requires a quadratic number of record pair comparisons to decide which record pairs are matches and which are non-matches. When the number of records is even moderately large (e.g., 106), the number of pairwise comparisons becomes prohibitively large.
Blocking was proposed as a strategy to scale entity resolution to large data sets. The basic idea is to utilize a “blocking function” on the values of one or more attributes to partition the input records into multiple small blocks, and restrict the subsequent pairwise matching to records in the same block. The advantage of this strategy is that it can significantly reduce the number of required pairwise comparisons, and make entity resolution feasible and efficient even for large data sets. The disadvantage of this strategy is false negatives: if there are incorrect values or multiple representations in the value of any attribute used by the blocking function, records that ought to refer to the same entity may end up with different blocking key values, and hence could not be discovered to refer to the same entity by subsequent pairwise matching and clustering steps.
The key to addressing the disadvantage mentioned above is to allow multiple blocking functions. Hernández and Stolfo [5] were the first to make this observation, and showed that using multiple blocking functions could result in high quality entity resolution without necessarily incurring a high computational cost. In general, such blocking functions create a set of overlapping blocks that balance the recall of entity resolution (i.e., absence of false negatives) with the number of comparisons incurred by pairwise matching. For example, q-grams (A q-gram of a string value is a substring of length q. A q-gram of a set of values is a subset of size q.) blocking creates blocks of records that share at least one q-gram. Similarly, the Canopy method [14] employs a computationally cheap string similarity metric for building high-dimensional, overlapping blocks.
Key Applications
Data Cleaning
One key problem for data cleaning is to identify duplicates in the data caused by mistyping, abbreviations, synonyms, and so on. Entity resolution is an important step in this regard towards building a clean data set.
Data Integration and Data Warehousing
Data integration systems and data warehouses integrate data from a large number of heterogeneous data sources. In addition to schema variety, which has been the focus of the data integration field since it started, the sources also exhibit high variety in their way of describing and representing the same real-world entity. Entity resolution is critical to identify the same entity thereby enabling information from different sources to be aligned and merged.
Data Sets
The following URL provides a collection of real data sets and synthetic-data generators that are commonly used for experiments on entity resolution:
Url to Code
SecondString is an open-source Java-based package of approximate string-matching techniques:
Cross-References
Recommended Reading
Newcombe HB, Kennedy JM, Axford SJ, James AP. Automatic linkage of vital records. Science. 1959;130(3381):954–59.
Fellegi IP, Sunter AB. A theory for record linkage. J Am Stat Assoc. 1969;64(328):1183–210.
Cohen WW. Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2008. p. 201–12.
Cohen WW, Ravikumar P, Fienberg SE. A comparison of string distance metrics for name-matching tasks. In: Proceedings of the 3rd International Workshop on Information Integration on the Web; 2003. p. 73–8.
Hernandez MA, Stolfo SJ. The merge/purge problem for large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 1995. p. 127–38.
Winkler WE. Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods; 1988. p. 667–71.
Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2002. p. 269–78.
Dey D. Entity matching in heterogeneous databases: a logistic regression approach. Decis Support Syst. 2008;44(3):740–47.
Hassanzadeh O, Chiang F, Miller RJ, Lee HC. Framework for evaluating clustering algorithms in duplicate detection. Proc. VLDB Endowment. 2009;2(1):1282–293.
Dong X, Halevy AY, Madhavan J. Reference reconciliation in complex information spaces. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2005. p. 85–96.
Chaudhuri S, Sarma AD, Ganti V, Kaushik R. Leveraging aggregate constraints for deduplication. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2007. p. 437–48.
Guo S, Dong XL, Srivastava D, Zajac R. Record linkage with uniqueness constraints and erroneous values. Proc. VLDB Endowment. 2010;3(1):417–28.
Li P, Dong XL, Maurino A, Srivastava D. Linking temporal records. Proc. VLDB Endowment. 2011;4(11):956–67.
McCallum AK, Nigam K, Ungar LH. Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2000. p. 169–78.
Kolb L, Thor A, Rahm E. Load balancing for MapReduce-based entity resolution. In: Proceedings of the 28th International Conference on Data Engineering; 2012. p. 618–29.
Gruenheid A, Dong XL, Srivastava D. Incremental record linkage. Proc. VLDB Endowment. 2014;7(9):697–708.
Fan W, Jia X, Li J, Ma S. Reasoning about record matching rules. Proc. VLDB Endowment. 2009;2(1):407–18.
Bansal N, Blum A, Chawla S. Correlation clustering. In: Proceedings of the 19th International Conference on Machine Learning; 2002. p. 238–47.
Baxter R, Christen P, Churches T. A comparison of fast blocking methods for record linkage. In: Proceedings of the ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation; 2003. p. 253–68.
Kopcke H, Thor A, Rahm E. Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endowment. 2010;3(1):484–93.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media, LLC, part of Springer Nature
About this entry
Cite this entry
Dong, X.L., Srivastava, D. (2018). Entity Resolution. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_2547
Download citation
DOI: https://doi.org/10.1007/978-1-4614-8265-9_2547
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering