Synonyms

Entity matching; Object deduplication; Record linkage; Reference reconciliation

Definition

Let \( \mathcal {E} \) denote a set of entities in a domain, described using a set of attributes \( {\mathcal {A}} \). Each entity \( E \in {\mathcal {E}} \) is associated with zero, one, or more values for each attribute \( A \in {\mathcal {A}} \). For each entity in \( {\mathcal {E}} \), there can be a set of records \( \mathcal {R} \), provided by one or more sources over the attributes \( {\mathcal {A}} \), where each record provides at most one value for an attribute. We consider atomic values (string, number, date, time, etc.) as attribute values, and allow multiple representations of the same value, as well as erroneous values, in records. Entity resolution takes as input the records provided by the sources and decides which records refer to the same entity; in particular, it computes a partitioning \( {\mathcal {P}} \) of \( {\mathcal {R}} \), such that records in each partition refer to the same entity, and records in different partitions refer to different entities.

Historical Background

The problem of entity resolution, originally identified by Newcombe et al. [1], was first formalized by Fellegi and Sunter [2]. Thereafter the problem has been extensively studied and a large number of approaches have been proposed. Notably, the literature on heterogeneous databases had focused on schema-centric approaches assuming a uniform representation of individual entities until the seminal paper [3], which addressed the entity-resolution problem as a core issue of data integration.

Early approaches for entity resolution focus on pairwise record matching, which consists of two components. The first component computes a vector of similarity scores for individual record pairs by comparing their attributes; a comprehensive survey can be found in [4]. The second component declares a candidate record pair as a match or non-match based on the similarity vector. A variety of methods are proposed for this component, including rule-based methods [5], classification-based methods [6, 7], and distance-based methods [8].

Subsequent proposals focus on improving entity resolution by leveraging a global view of the data. First, clustering algorithms have been proposed to resolve inconsistent decisions for pairs of records (e.g., deciding a match between R1 and R2, a match between R2 and R3, and a non-match between R1 and R3); a comprehensive comparison of these methods can be found in [9]. Second, collective entity resolution was proposed for related entities of multiple types; entity relationships such as co-author and co-citation are considered when computing record similarity [10]. Third, constraints for the entity domain were considered as extra evidence for reconciliation [11]. Fourth, new approaches have been proposed to reconcile entities being aware of possible value varieties, such as incorrect values [12] and evolving values [13].

To complement the previously described approaches that focus on improving the effectiveness of entity resolution, strategies have been proposed to improve efficiency and scalability of entity resolution, such as Canopy [14]. This thread has been revived recently for big data management, where algorithms have been proposed for load-balancing in a MapReduce based framework [15], for incremental entity resolution [16], and so on.

Scientific Fundamentals

Entity resolution consists of three steps: blocking, pairwise matching, and clustering. Among them, pairwise matching and clustering are used to ensure the semantics of entity resolution, while blocking is used to achieve scalability.

Pairwise Matching

The basic step of entity resolution is pairwise matching, which compares a pair of records and makes a local decision of whether or not they refer to the same entity. A variety of techniques have been proposed for this step.

Rule-based approaches [5, 17] have been commonly used for this step in practice, and apply domain knowledge to make the local decision. The advantage of this approach is that the rules can be tailored to effectively deal with complex matching scenarios. However, a significant disadvantage of this approach is that it requires considerable domain knowledge as well as knowledge about the data to formulate the pairwise matching rules, rendering it infeasible when the domain and the data exhibit considerable heterogeneity.

Classification-based approaches have also been used for this step since the seminal paper by Fellegi and Sunter [2], wherein a classifier is built using positive and negative training examples, and the classifier decides whether a pair of records is a match or a non-match; it is also possible for the classifier to output a “possible match”, in which case the local decision is turned over to a human. Such classification-based machine learning approaches have the advantage that they do not require significant domain knowledge, but only knowledge of whether a pair of records in the training data refers to the same entity or not. A disadvantage of this approach is that it often requires a large number of training examples to accurately train the classifier, though active learning based variations [7] are often effective at reducing the volume of training data that is needed.

Finally, distance-based approaches [8] compute record-level distance as the weighted sum of similarity of corresponding attribute values (e.g., using Levenstein distance for computing similarity of strings, and Euclidean distance for computing similarity of numeric attributes). Low and high thresholds are used to declare matches, non-matches and possible matches. A key advantage of this approach is that the domain knowledge is limited to formulating distance metrics on atomic attributes, which can be potentially reused for a large variety of entity domains. A disadvantage of this approach is that it is a blunt hammer, which often requires careful parameter tuning (e.g., what should the weights on individual attributes be for the weighted sum, what should the low and high thresholds be), though machine learning approaches can often be used to tune the parameters in a principled fashion.

Clustering

As mentioned previously, the local decisions of match or non-match made by the pairwise matchings step may not be globally consistent. The purpose of the clustering step is to reach a globally consistent decision of partitioning the set of all records such that each partition refers to a distinct entity.

This step first constructs a pairwise matching graph G, where each node corresponds to a distinct record Ri, and there is an undirected edge (Ri, Rj) if and only if the pairwise matching step declared a match between Ri and Rj. A clustering of G partitions the nodes into pairwise disjoint subsets based on the edges that are present in G. There exists a wealth of literature on clustering algorithms for entity resolution [9]. These clustering algorithms tend not to constrain the number of clusters in the output, since the number of entities in the data set is typically not known a priori.

One of the simplest graph clustering strategies is to efficiently cluster graph G into connected components by a single scan of the edges in the graph [5]. Essentially, this strategy places a high trust on the local “match” decision, so even a few erroneous match decisions can significantly alter the results of entity resolution.

At the other extreme, a robust but expensive graph clustering algorithm is correlation clustering [18]. The goal of correlation clustering is to find a partitioning of nodes in G that agrees as much as possible with the edges in G. To achieve this goal, one can either maximize agreements or minimize disagreements between the clustering and the edges. The two strategies are equivalent from the optimization point of view but differ from the approximation point of view. When minimizing disagreements, for each pair of nodes in the same cluster that are not connected by an edge, there is a cohesion penalty of 1; for each pair of nodes in different clusters with an edge, there is a correlation penalty of 1. Correlation clustering seeks to compute the clustering that minimizes the overall sum of the penalties. It has been proved that correlation clustering is NP-complete, and many efficient approximation algorithms have been proposed for this problem.

Blocking

Pairwise matching and clustering together ensure the desired semantics of entity resolution, but may be quite inefficient and even infeasible for a large set of records. The main source of inefficiency is that pairwise matching requires a quadratic number of record pair comparisons to decide which record pairs are matches and which are non-matches. When the number of records is even moderately large (e.g., 106), the number of pairwise comparisons becomes prohibitively large.

Blocking was proposed as a strategy to scale entity resolution to large data sets. The basic idea is to utilize a “blocking function” on the values of one or more attributes to partition the input records into multiple small blocks, and restrict the subsequent pairwise matching to records in the same block. The advantage of this strategy is that it can significantly reduce the number of required pairwise comparisons, and make entity resolution feasible and efficient even for large data sets. The disadvantage of this strategy is false negatives: if there are incorrect values or multiple representations in the value of any attribute used by the blocking function, records that ought to refer to the same entity may end up with different blocking key values, and hence could not be discovered to refer to the same entity by subsequent pairwise matching and clustering steps.

The key to addressing the disadvantage mentioned above is to allow multiple blocking functions. Hernández and Stolfo [5] were the first to make this observation, and showed that using multiple blocking functions could result in high quality entity resolution without necessarily incurring a high computational cost. In general, such blocking functions create a set of overlapping blocks that balance the recall of entity resolution (i.e., absence of false negatives) with the number of comparisons incurred by pairwise matching. For example, q-grams (A q-gram of a string value is a substring of length q. A q-gram of a set of values is a subset of size q.) blocking creates blocks of records that share at least one q-gram. Similarly, the Canopy method [14] employs a computationally cheap string similarity metric for building high-dimensional, overlapping blocks.

Key Applications

Data Cleaning

One key problem for data cleaning is to identify duplicates in the data caused by mistyping, abbreviations, synonyms, and so on. Entity resolution is an important step in this regard towards building a clean data set.

Data Integration and Data Warehousing

Data integration systems and data warehouses integrate data from a large number of heterogeneous data sources. In addition to schema variety, which has been the focus of the data integration field since it started, the sources also exhibit high variety in their way of describing and representing the same real-world entity. Entity resolution is critical to identify the same entity thereby enabling information from different sources to be aligned and merged.

Experimental Results

In general, for every presented method, there is an accompanying experimental evaluation in the corresponding reference. Comprehensive experimental studies in this field include [19] for blocking, [20] for entity matching, and [9] for clustering.

Data Sets

The following URL provides a collection of real data sets and synthetic-data generators that are commonly used for experiments on entity resolution:

http://www.cs.utexas.edu/users/ml/riddle/data.html.

Url to Code

SecondString is an open-source Java-based package of approximate string-matching techniques:

http://secondstring.sourceforge.net/.

Cross-References