Keywords

1 Introduction

RDF is now ubiquitous on the Web. For example, more than 1 billion URLs embed RDF data in different serialization formats. A representative fragment of the RDF data available on the planet, the Linked Open Data Cloud, has grown from merely 12 to almost 10,000 knowledge bases [9] over the last decade. The growth in the number of knowledge bases is accompanied by a growth in size. For example, DBpedia–one of the most popular bases–has grown from \(\approx 10^8\) triples (DBpedia 2.0) in 2007 to \(\approx 10^{10}\) triples in 2016 (DBpedia 2016-04). This growth in the number and size of knowledge bases has led to an increase of the necessity for efficient and accurate link discovery (i.e., the computation of links between knowledge bases) approaches.

In this paper, we consider the declarative link discovery setting [15], in which the set of conditions under which two resources are to be linked is to be devised explicitly and subsequently executed to compute links across two (not necessarily distinct) sets of RDF resources S and T. The declarative link discovery problem has been shown to be challenging even for domain experts, since this set of conditions, which we call a link specification, can be very complex [11]. This complexity is due to

  1. 1.

    the plethora of similarity measures (e.g., edit distance, cosine similarity, Jaccard similarity) that are used to compare the property values of entities to find links and

  2. 2.

    the manifold means through which these similarities can be combined (e.g., min, max, linear combinations).

A large number of dedicated machine learning algorithms ranging from genetic programming [11] to refinement operators [25] has hence been devised to simplify the declarative link discovery process (see [15] for a survey). While the F-measure of these machine learning approaches has increased steadily, little attention has been paid to their time efficiency. Most commonly, the approaches are declared scalable by virtue of the bound similarity computation algorithms they rely upon, e.g., AllPairs [1], PPJoin+ [30], EdJoin [29]. First pruning approaches are developed in works such as [25] but solely for particularly slow versions of the algorithm. Given that the time efficiency of learning approaches for link discovery is critical for their usefulness in practical applications (e.g., in learning scenarios with humans in the loop), the primary aim of this work is hence to improve the time efficiency of link discovery algorithms, while maintaining their classification performance (i.e. maintaining the same F-measure). We achieve this goal with our novel approach Dragon, a decision-tree-based approach for link discovery. The contributions of this work are as follows:

  1. 1.

    We devise an efficient and effective algorithm for learning link specifications within the decision tree paradigm.

  2. 2.

    We evaluate our algorithm on nine benchmark data sets against state-of-the-art link discovery and decision-tree-learning approaches. Our results show that Dragon outperforms existing link discovery solutions significantly w.r.t. runtime, while achieving equally good performance w.r.t. the F-measure. Moreover, our approach outperforms generic solutions to learning decision trees significantly.

  3. 3.

    We investigate why our approach produces better link specifications than common decision-tree algorithms.

An open-source implementation of Dragon is provided in the LIMES [18] frameworkFootnote 1.

This paper is structured as follows: We start by giving a formal definition of the key concepts underlying this work. Thereafter, we give a brief overview of the state of the art. We subsequently present our approach to link discovery, which we finally evaluate on synthetic and real-world datasets.

2 Preliminaries

Definition 1

(Link Discovery). Given two sets S (source) and T (target) of RDF resources and a relation R, compute the set M of pairs of instances \((s,t) \in S \times T\) such that R(st).

We call M a mapping. Commonly, the set S is a subset of the instances contained in the knowledge base \(\mathcal {K}_S\). The same applies to the set T and a knowledge base \(\mathcal {K}_T\). Note that neither S and T nor \(\mathcal {K}_S\) and \(\mathcal {K}_T\) are necessarily disjoint. Since computing M is a non-trivial task, most frameworks compute an approximation \(M' = \{(s,t) \in S \times T : \sigma (s,t) \ge \theta \}\), where \(\sigma \) is a similarity function and \(\theta \) is a similarity threshold. The relation R(st) is then considered to hold if \(\sigma (s,t) \ge \theta \). The similarity function \(\sigma \) and the threshold \(\theta \) are expressed in a link specification (LS). Different grammars have been proposed to describe LSs [12, 14, 19]. We adopt the following formal setting, which is akin to that of [25]. This grammar is equivalent to that used in a large body of work [15].

We begin by defining the syntax of link specifications. To this end, we define a similarity measure m to be a function \(m : S \times T \rightarrow [0,1]\). These functions commonly compare attributes (or sets of attributes) of pairs of resources to compute a similarity value. Mappings \(M \subseteq S \times T\) are used to store the results of the application of a similarity function to \(S \times T\). We define a filter as a function \(f(m,\theta )\) over the set of all mappings (i.e., the powerset of \(S \times T\)). A link specification is called atomic iff it comprises exactly a single filtering function. A complex specification L can be obtained by combining two link specifications \(L_1\) and \(L_2\) with the operators \(\sqcap , \sqcup \) and \(\backslash \).

We define the semantics \([[L]]_M\) of a LS L w.r.t a mapping M as follows:

  • \([[f(m, \theta )]]_M = \{(s,t) | (s,t) \in M \wedge m(s, t) \ge \theta \}\)

  • \([[L_1 \sqcap L_2]]_M = \{ (s,t) | (s,t) \in [[L_1]]_M \wedge (s,t) \in [[L_2]]_M \}\)

  • \([[L_1 \sqcup L_2]]_M = \{ (s,t) | (s,t) \in [[L_1]]_M \vee (s,t) \in [[L_2]]_M \}\)

  • \([[L_1 \backslash L_2]]_M = \{ (s,t) | (s,t) \in [[L_1]]_M \wedge (s,t) \notin [[L_2]]_M \}\)

Moreover, we write [[L]] as a shorthand for \([[L]]_{S \times T}\).

Definition 2

(Link Discovery as Classification). The goal of link discovery can be translated to finding a classifier \(\mathcal {C}: S \times T \rightarrow \{-1, +1\}\) which maps non-matches (i.e. \((s, t) \in S \times T: \lnot R(s,t)\)) to the class \(-1\) and matches to \(+1\).

The classifier returns \(+1\) for a pair (st) iff \(\sigma (s, t) \ge \theta \) for the corresponding link specification. In all other cases, the classifier returns \(-1\). In this work, we use decision trees for link discovery. The attributes we use are similarity measures. Because these measures have numeric values, we compute decision trees with binary splits. An example tree is shown in Fig. 1.

Fig. 1.
figure 1

A decision tree learned by Dragon for the dataset of which a part is shown in Table 1. The classification decisions are shown in the gray nodes.

Table 1. Example datasets containing persons and link candidates with appropriate labeling showing corresponding similarity values using cosine similarity on name attributes. We represent the data as tables and omit namespaces for the sake of legibility.

3 The Dragon Algorithm

In the following, we present how we use decision trees to learn accurate LSs efficiently. We begin by giving a brief overview of our approach. Thereafter, we show how we tailor the construction of decision trees to the LS learning problem. Finally, we show how to prune trees to avoid overfitting.

3.1 Overview

Algorithms 1 and 2 give an overview of our approach depending on the measure we use. We assume that we are given a maximum height for the tree. To determine the root of the tree we detect the best split attribute. We considered two measures for this purpose: the global F-measure aims to maximize the total F-measure achieved by the decision tree while the Gini measure acts locally and aims to ensure that the split attribute is optimal. We elaborate on how we use these fitness functions to find good split attributes in the section “Determining Split Attributes”. Dragon continues recursively by constructing the left and right child of the tree if it found a split attribute and has not yet reached the maximum height. For Gini, it updates the training data depending on the split attribute by removing any link pairs that are not accepted by the split attribute. For example, for the decision tree shown in Fig. 1, the training data for the right child of the root node would not include the pair (S5, T5) from Table 1(c), because the similarity value of the qgrams measure on the name attribute is lower than 0.4. The training data for the left child (ergo, the “no” child, see Fig. 1) is the global training data used to initialize our approach minus the training data for the right child of the root node. For example, the pair (S4, T4) would not be contained in the training data for the left child, since the similarity value of the qgrams measure on the name attribute is 1.0, and it therefore belongs to the training data of the right child. For global F-measure, the training data does not need to be changed, since we try to optimize globally.

In the following, we take a closer look at how the split attributes are computed.

figure a
figure b

3.2 Initialization

The goal of the initialization is to determine the set of mappings that will be used during the construction of the decision tree. We assume that we are given a training data set TS with \(TS \subseteq S \times T \times \{+1, -1\}\) (see Table 1 for an example). Every triple \((s, t, +1) \in TS\) is a positive example, while every triple \((s, t, -1) \in TS\) is a negative example. We begin by calculating the subsets \(S' \subseteq S\) and \(T' \subseteq T\) that can be found in our training dataset as follows:

$$\begin{aligned} S' = \{s \in S: \exists t \in T \text{ with } (s, t, +1) \in TS \vee (s, t, -1) \in TS \}, \end{aligned}$$
$$\begin{aligned} T' = \{t \in T: \exists s \in S \text{ with } (s, t, +1) \in TS \vee (s, t, -1) \in TS \}. \end{aligned}$$

We use the sets \(S'\) and \(T'\) as test sets for our algorithm.

If we use the global F-measure as fitness function, we adopt the approach used by [25] by computing the mapping \(M_{i} = \{(s,t) \in S' \times T' : m_{i}(s,t) \ge \theta _i\}\) which achieves the highest F-measure on the training dataset TS for each of the similarity measures \(m_i\) available to our approach. We determine the threshold \(\theta _i\) by lowering the value of the said threshold from 1 to a lower bound \(\lambda \) by a given rate \(\tau \in [0,1)\).

For the Gini Index, we begin by calculating the similarity of each entity pair \((s, t) \in S' \times T'\) using all the measures \(m_{i}\) available to our learner. Any similarity values below the lower bound \(\lambda \) are disregarded i.e., set to 0. For each measure \(m \in m_{i}\) we now have a list of similarity values \([sim_{i1},...,sim_{in}]\) ordered from lowest to highest, with the corresponding entity pair \((s,t) \in S' \times T'\).

3.3 Determining Split Attributes

We build our decision tree using top-down induction [24]. We implement two measures to decide on the attribute to use for the splits: Global F-measure and Gini Index. With the global F-measure, we target the improvement of the overall performance of the decision tree we learn. In contrast, the Gini Index aims to improve the local performance of a given leaf. The performance of the two strategies is compared in the evaluation section.

Learning with the Global F-Measure. The trees we learn with the global F-measure are a combination of the atomic LS computed during the initialization step. Let k be the number of these atomic LSs. We set the atomic link specification which leads to the mapping with the highest F-measure over all measures \(m_i\) as our first split attribute. After this initialization of the tree, our tree consists of a root and 2 leaves. In the example shown in Fig. 1, the root would be the node \(jaccard(name, name) \ge 0.4\). We first remove the root from the set of LSs that can be added to the tree, leading to \(k-1\) LSs still being contained in the set of candidate LSs. We sequentially position every of the remaining \(k-1\) LSs at every of the 2 leaf nodes, hence generating \(2(k-1)\) trees. We then compute the resulting mapping and select the tree with the best F-measure. After removing the LS used from the set of candidate LSs, we iterate the addition approach until we cannot improve the F-measure achieved by the tree or until we have used all atomic LS we computed during the initialization step.

Formally, at iteration \(i \in \{1, \ldots , k\}\), we have \((k-i+1)\) trees to try out and i nodes in the decision tree where an atomic LS can be added. Hence, the maximal number of trees we need to generate is given by the following:

$$\begin{aligned} \sum \limits _{i=1}^k i(k-i+1) = \frac{k(k+1)(k-2)}{6} \in O(k^3). \end{aligned}$$
(1)

Our algorithm is hence clearly polynomial in the number of trees computed. In contrast, generating all possible decision trees which can be created using k atomic link specifications is exponential in complexity.

Learning with the Gini Index. We use the similarity values we determined in the initialization in combination with the Gini Index as follows: We determine which measure will be our splitting attribute by calculating the average Gini Index of all measures still available as follows:

$$\begin{aligned} \overline{G}(\mathcal {N}, sim_{ij})=\frac{|\mathcal {N}_{l}|}{|\mathcal {N}|} \times G(\mathcal {N}_{l}) + \frac{|\mathcal {N}_{r}|}{|\mathcal {N}|} \times G(\mathcal {N}_{r}), \end{aligned}$$
(2)

where \(|\mathcal {N}|\) is the number of pairs accepted by the node \(\mathcal {N}\) of the decision tree. In the first iteration, \(\mathcal {N}\) contains the training data. \(\mathcal {N}_{l}\) accepts the pairs with similarity values below \(sim_{ij}\), \(\mathcal {N}_{r}\) permits the pairs with values above or equal to \(sim_{ij}\). For each \(m_{i}\) we calculate the average Gini Index for all \(sim_{ij}\), with \(j \in \{1,\cdots ,n\}\), to determine the best split point for each measure.

The Gini Index is

$$\begin{aligned} G(\mathcal {N})= 1 - \left( \left( \frac{|\mathcal {N}^{+}|}{|\mathcal {N}|}\right) ^{2} + \left( \frac{|\mathcal {N}^{-}|}{|\mathcal {N}|}\right) ^{2} \right) , \end{aligned}$$
(3)

where \(|\mathcal {N}^{+}|\) is the number of positive examples and \(|\mathcal {N}^{-}|\) the number of negative examples accepted by \(\mathcal {N}\). The splitting attribute will be the measure m with the corresponding threshold \(\theta = sim_{ij}\). Common decision tree algorithms will set the threshold in the splitting attribute to be \(\theta = (sim_{ij} + sim_{i(j-1)} )/ 2\), whereas we set it to the higher value (i.e. \(sim_{ij}\)). In the evaluation section we will see that this choice improves the quality of our decision tree considerably.

After finding the root node the training data is provided as seen in Algorithm 1, we stop when we either have reached the maximum tree height or measure with an average Gini Index below 1 can be found.

For example, imagine we were to use the example in Table 1 to learn a tree using Gini Index. \(S' \times T'\) are the link pairs from the training data, therefore the first step will be to determine the similarity value for these pairs using all available measures on all attributes. We provided the cosine similarity on the name attributes in Table 1(c). To test how well the cosine similarity performs we have to find the ideal split point. We start with the lowest similarity value, that is bigger than 0, which in our case is 0.707. \(|\mathcal {N}_{l}| = 2\), since only two links have a similarity value below 0.707. \(|\mathcal {N}_{r}| = 3\) containing the remaining links. To calculate the average Gini Index we need to determine the Gini Indices for the left and right node. Both are 0 since the left node only contains negative and the right only positive examples. Since our split attribute perfectly divides the examples to the appropriate classes we have a pure leaf with the average Gini Index of 0 and our tree induction is finished. Bear in mind that we choose \(\theta = 0.707\) in our link specification in contrast to common decision tree algorithms that would take \(\theta = (0.707 + 0.0)/2 = 0.3535\).

3.4 Pruning

Previous works (e.g., [24]) have shown that pruning decision trees can improve their performance significantly. Given our approach to decision tree generation, we devised a Global F-measure pruning and implemented the error estimate pruning used by [24] for the sake of comparison. Given a tree \(\mathcal {N}\) with height h, we start our pruning process by iterating over the nodes \(\mathcal {N}_{i}\) at height \(h-2\), with \(i \in \{1,\dots ,n\}\), where n is the number of nodes at height \(h-2\). We compute the F-measure of \(\mathcal {N}\), after we pruned the left, right and both leaves of \(\mathcal {N}_{i}\) respectively. The tree with the best F-measure is kept, i.e. \(\mathcal {N}\) will be overwritten by it. After repeating this process for all nodes at height \(h-2\), we continue at height \(h-3\) and so on until we reach the root node and terminate.

If we were to prune the tree from Fig. 1, we would subsequently compute the F-measure of the whole tree and after we removed the node \(jaccard (age, age) \ge 0.8\).

Table 2. Characteristics of the used datasets.

4 Evaluation

The aim of our evaluation was to evaluate the effectiveness (i.e., the F-measure) and the efficiency (i.e., the runtime) achieved by Dragon. We were especially interested in the performance of Dragon on real datasets. Hence, we evaluated Dragonon the four real-world datasets from [14]. These datasets were obtained by manually curating data harvested from the Web and determining links between these datasets manually. In addition, we aimed to compute the performance of Dragon on synthetic datasets. We selected three datasets from OAEI 2010 benchmarkFootnote 2 and the two datasets Drugs and Movies, used in [19]. Table 2 presents some details pertaining to these datasets. We ran two series of experiments. In the first series on experiments, we tested whether our approach to setting thresholds in decision trees is superior to that followed by other decision-tree learning approaches. To this end, we compared different ways of setting thresholds in our algorithm. We also used this experiment to determine the default settings for subsequent experiments. In our second series of experiments, we compared Dragon with state-of-the-art link discovery algorithms. In all experiments we used a ten-fold cross validation setting were the training folds consist of 50% positive and 50% negative examples. We present the average results achieved by all algorithms over the ten runs of the crossvalidation setting. To make a comparison between algorithms over all datasets easier we also calculated the average rank of the approaches given the performance measure (time or f-measure). To achieve this the approaches are ranked by their performance per dataset, for ties the mean of the ranks are assigned, and the ranks are averaged columnwise to get the final value. This average rank is added as last row in the tables we present. All experiments were carried out on a 64-core 2.3 GHz Server running Oracle Java 1.8.0_77 on Ubuntu 14.04.4 LTS, with each experiment assigned 20 GB of RAM.

4.1 Parameter Discovery

To check whether our approach to discovering settings is better than that followed by other decision-tree-based approaches, we ran our Gini approach combined with the Global F-measure pruning (G) and the error estimate pruning (E). We set \(\lambda \) between 0.05 and 0.8. Our results are displayed in Table 3. In our experiments, \(\lambda =0.4\) achieves the best rank and is therefore an appropriate setting for our algorithm. We hence selected this value as default.

Table 3. F-measure for 10-fold cross validation averaged over 10 results. UP indicates choosing the upper value as split point while MP uses the middle between two similarity values as split point in the node. The best, second- and third-best value in a row are highlighted using colored cells of decreasing strength.

We then compared our approach to setting thresholds in split points with that implemented by classical decision-tree-learning approaches such as J48 (see two right-most columns of Table 3). Note that we used \(\lambda =0.05\) to test if a higher threshold to calculate the initial similarity values improves the quality of the constructed decision tree. While choosing the middle between similarity values as split point seems to be beneficial on some datasets (e.g., Abtbuy), it is clear that choosing the upper value bears better results overall. A comparison of the difference in F-measure achieved by the settings \(\lambda =0.05\) (first column in Table 3) and \(\lambda =0.4\) (third column in Table 3) further reveals that preventing a decision tree from learning measures with a low threshold value can improve the quality up to 43% (Amazon-GP).

4.2 Comparison with Other Approaches

Our choice of state-of-the-art link discovery algorithm to compare with Dragon was governed by the need to provide a fair evaluation. Firstly, the algorithms we were to test against needed to learn link specifications explicitly, since our approach tackles the task of declarative link discovery. Secondly, we needed approaches able to perform supervised learning, since our approach would have an unfair advantage over unsupervised classifiers. We hence chose Wombat since it fits these requirements and to perform as well as [13] w.r.t. the F-measure it achieves. In addition, Wombat was shown to be robust w.r.t. the number of examples used for learning. We also ran our experiments with Eagle  [19] and Euclid  [20]. We selected Eagle  [19] because it was shown to outperform MARLIN [2] and FEBRL [4] significantly in terms of runtime, while achieving comparable F-measure [19]. Euclid ’s major advantage over Eagle is the fact that it is deterministic and reaches similar F-measure to Eagle. In our evaluation we use the supervised linear version of Euclid. All the chosen approaches and especially Dragon are contained in the open-source LIMES link discovery framework and are free to use. We decided not to compare Dragon with classifiers from e.g. SILK [28], since LIMES has been shown [18] to be significantly faster than SILK and therefore runtime comparisons would not be fair. We also compare our approaches with J48, an often used implementation of the C4.5 algorithm [10].

Table 4. Averaged results for 10-fold cross validation. We highlight the best, second- and third-best value in a row using colored cells of decreasing strength.

To test our hypothesis that common decision tree approaches choose threshold values that are too low, we also implemented J48opt. In this approach, we took the decision tree we get from J48, parsed it into a LS, raised all its thresholds by \(\delta \) and calculated the F-measure it achieved. We repeated this process until all thresholds equal 1. We took the LS with the threshold setting that resulted in the highest F-measure. In our experiments, we set \(\delta \) to 0.05. For Dragon, we tested Global F-measure (in the following referred to as Dragon \(_{GL}\)) and Gini Index (Dragon \(_{GI}\)), as well as the two pruning algorithms Global F-measure pruning and error estimate pruning. We will also indicate the pruning method in subscript as well. Hence, Dragon \(_{GL \cdot E}\) is a configuration of Dragon which was achieved by the use of Global F-measure for finding split attributes and error estimate pruning. In contrast, Dragon \(_{GL \cdot G}\) is Dragon with global F-measure and global F-measure pruning. We set the maximum tree height to 3. For the Gini configurations, we set \(\lambda \) to 0.4. For the global F-measure configurations, we set \(\lambda \) to 0.6. We discovered that each approach performs best with these parameters through empirical tests. The termination criteria for Wombat was either finding a link specification with F-measure of 1 or a refinement depth of 10. The coverage threshold was set to 0.6 and the similarity measures used were the same as in Dragon: jaccard, trigrams, cosine and qgrams. Eagle was configured to run 100 generations. The mutation and crossover rates were set to 0.6 as in [19]. Euclid was set to the default parameters. For J48, we discovered, that using reduced error pruning with 5 folds delivered the best results overall. Otherwise we used the default parameters found in the Weka framework [10].

To determine the significant differences between classifier performances usually a pair-wise Wilcoxon signed-ranks test is performed. However, this would lead to a multiple testing problem in our case, since we compare more than two classifiers. We therefore follow the recommendations given in [7] to use a Friedman test to determine if the average ranks of the algorithms are significantly different and, if this is true, perform a Nemenyi test to compare all classifiers. In Table 4 the results are presented. Concerning F-measure we have significant differences between the algorithms (Friedman p-value = 0.009). We can see that Wombat and Dragon \(_{GI \cdot G}\) produce the best results regarding F-measure and there is no significant difference in performance (Nemenyi p-value = 0.999). Wombat achieves a higher F-measure than Dragon \(_{GI \cdot G}\) on four datasets (Movies, Person1, DBLP-ACM, AMAZON-GP) and is tied with Dragon on the Drugs dataset. Dragon outperforms Wombat on the remaining four datasets. The only significant difference in F-measure is between Wombat and j48 (Nemenyi p-value = 0.007) and between Dragon and j48 (Nemenyi p-value = 0.023). We can also see that J48opt performs better than J48, albeit not significantly better (Nemenyi p-value = 0.326).

In Table 4b, we present the average runtimes of the approaches. We can determine, that there is a significant difference in runtime-efficiency between the algorithms (Friedman p-value = \(1.127 \times 10^{-8}\)). It is evident, that Dragon(specifically Dragon \(_{GI}\)) and J48 are the fastest approaches. We determined no significant difference between them (Nemenyi p-value = 0.999). The Dragon \(_{GI}\) configurations of our approach are on average 18 times faster than WombatFootnote 3 , while Dragon \(_{GL}\) is roughly as efficient as Wombat. The performance advantage of Dragon \(_{GI}\) is due to the fact that, after calculating the initial similarity values of the entity pairs, it does not need the costly calculations of mappings.Footnote 4 Overall, our results suggest that Dragon performs as well as the state of the art w.r.t. the quality of the LSs it generates, but clearly outperforms the state of the art w.r.t. its runtimeFootnote 5, making it more conducive to practical application.

4.3 Efficiency of Pruning

Since pruning is an important factor in decision tree learning, we recorded the effect pruning had in the second experiment. The results are displayed in Fig. 2.

First note that the configuration Dragon \(_{GL \cdot GL}\) was omitted in the figures since the unpruned and pruned trees were identical. This is not surprising because the measure for building the tree is the same as for pruning it. Therefore, any leaves that would be pruned simply do not appear in the tree in the first place. For the other configurations, we can observe, that pruning has on average a positive impact on F-measure. The exceptions are datasets for which LS of the size 1 are learned in the first place (such as Movies and Person2 for Dragon \(_{GI}\)) and Amazon-GP, where pruning has a negative effect.

Fig. 2.
figure 2

Percentage change between unpruned and pruned trees in the averaged results of the ten-fold crossvalidation

5 Related Work

A plethora of approaches for link discovery has been developed recently. We give a brief overview of existing approaches and refer the reader to the corresponding surveys and comparisons [6, 15, 26] for further details. A popular link discovery framework is SILK [28], which uses multi-dimensional blocking to achieve lossless link discovery. Another lossless framework is LIMES [17], which combines similarity measures using a set-theoretical approach. An overview and a comparison of further link discovery frameworks can be found in [15].

As we have seen in definition 2, link discovery is a binary classification problem and therefore can be tackled with classical machine learning approaches such as support vector machines [2], artificial neural networks [16] and genetic programming [11]. [26] give an overview of these approaches and compare them. Most dedicated approaches to learning link specifications are supervised techniques implemented as batch learning or as active learning. For example, Eagle [19], COALA [21] and GenLink [11] support both active and batch learning within the genetic programming paradigm. Other approaches such as Euclid [20] and KnoFuss [22] support unsupervised learning. Wombat [25] uses an upward refinement operator to implement positive-only learning.

Record linkage is a closely related field of link discovery and there are several approaches which use decision trees. Early works by [5] aimed at matching two databases containing customer records. They used manually generated training data to train a CART [3] classifier and pruned it to reduce complexity and make it more robust. [8] incorporated ID3 [23] into their record linkage toolbox TAILOR. They implemented two approaches: In the first they manually labeled record pairs and used the comparison vector to train a decision tree. The second approach tried to overcome the manual labeling effort by applying a clustering algorithm on their comparison vectors to get three clusters: matches, non-matches, potential matches. Another notable approach is Active Atlas [27], which combines active learning with the C4.5 [24] decision tree learning algorithm.

Our approach extends these approaches in several ways. By learning an operator tree our approach is able to learn much more expressive models than linear classifiers (e.g. [2]), since we can express conjunction and disjunction. Furthermore, we are more expressive than boolean classifiers such as [4, 16], because we can model negations. We additionally combine the generation of expressive link specifications with a deterministic approach, providing more reliability than comparable non-deterministic approaches [11, 19, 21, 22]. Dragon differs further by relying on a decision-tree-based learning paradigm, which allows it to not have to recompute mappings, leading to a more time-efficient approach (see evaluation section). It shares some similarity with the state of the art by virtue of its iterative approach to addressing the generation of link specifications (see, e.g., [25]).

6 Conclusion and Future Work

In this work, we presented Dragon, a decision tree learning approach for link discovery. We evaluated our approach on nine benchmark datasets against three state-of-the-art approaches Wombat, Euclid and Eagle. Our approach delivers state-of-the-art performance w.r.t. the F-measure it achieves while on average being more than 18 times faster than the state of the art. We investigated why our approach outperforms other decision tree approaches by optimizing the choice of thresholds in similarity measures. Interestingly, our Global F-measure approach is biased towards precision while using Gini leads to a bias towards recall. In future work, we will investigate how we can use these biases for ensemble learning. We will also look into the possibility to perform active learning with Dragon, in particular with the use of incremental decision trees.