1 Introduction

Graphs are recognized as versatile alternative to feature vectors and thus, they found widespread application in pattern recognition and related fields [1, 2]. However, one drawback of graphs, when compared to feature vectors, is the significant increase of the complexity of many algorithms. Regard, for instance, the algorithmic comparison of two patterns (which is actually a basic requirement for pattern recognition). Due to the homogeneous nature of feature vectors, pairwise comparisons is straightforward and can be accomplished in linear time with respect to the length of the two vectors. Yet, the same task for graphs, commonly referred to as graph matching, is much more complex, as one has to identify common parts of the graphs by considering all of their subsets of nodes. Regarding that there are \(O(2^n)\) subsets of nodes in a graph with n nodes, the inherent difficulty of graph matching becomes obvious.

In the last four decades a huge number of procedures for graph matching have been proposed in the literature [1, 2]. They range from spectral methods [3, 4], over graph kernels [5, 6], to reformulations of the discrete graph matching problem to an instance of a continuous optimization problem (basically by relaxing some constraints) [7]. Graph edit distance [8, 9], introduced about 30 years ago, is still one of the most flexible graph distance models available and topic of various recent research projects.

In order to compute the graph edit distance often A* based search techniques using some heuristics are employed (e.g. [10]). Yet, exact graph edit distance computation based on a tree search algorithm is exponential in the number of nodes of the involved graphs. Formally, for two graphs with m and n nodes we observe a time complexity of \(O(m^n)\). This means that for large graphs the computation of the exact edit distance is intractable.

In [11] authors of the present paper introduced an algorithmic framework for the approximation of graph edit distance. The basic idea of this approach is to reduce the difficult problem of graph edit distance to a linear sum assignment problem (LSAP), for which an arsenal of efficient (i.e. cubic time) algorithms exist [12]. In two recent papers [13, 14] the optimal algorithm for the LSAP has been replaced with a suboptimal greedy algorithm which runs in quadratic time. Due to the lower complexity of this suboptimal assignment process, a substantial speed up of the complete approximation procedure has been observed. However, it was also reported that the distance accuracy of this extension is slightly worse than with the original algorithm. Major contribution of the present paper is to improve the overall distance accuracy of this recent procedure by means of an elaborated transformation of the underlying cost model.

The remainder of this paper is organized as follows. Next, in Sect. 2, the computation of graph edit distance is thoroughly reviewed. In particular, it is shown how the graph edit distance problem can be reduced to a linear sum assignment problem. In Sect. 3, the transformation of the cost model into a utility model is outlined. Eventually, in Sect. 4, we empirically confirm the benefit of this transformation in a classification experiment on three graph data sets. Finally, in Sect. 5, we conclude the paper.

2 Graph Edit Distance (GED)

2.1 Exact Computation of GED

A graph g is a four-tuple \(g = (V,E,\mu ,\nu )\), where V is the finite set of nodes, \(E\subseteq V \times V\) is the set of edges, \(\mu :V\rightarrow L_V\) is the node labeling function, and \(\nu : E \rightarrow L_E\) is the edge labeling function. The labels for both nodes and edges can be given by the set of integers \(L = \{1,2,3,\ldots \}\), the vector space \(L = \mathbb {R}^n\), a set of symbolic labels \(L = \{\alpha , \beta , \gamma ,\ldots \}\), or a combination of various label alphabets from different domains. Unlabeled graphs are obtained by assigning the same (empty) label \(\varnothing \) to all nodes and edges, i.e. \(L_V = L_E = \{\varnothing \}\).

Given two graphs, \(g_1=(V_1,E_1,\mu _1,\nu _1)\) and \(g_2=(V_2,E_2,\mu _2,\nu _2)\), the basic idea of graph edit distance (GED) [8, 9] is to transform \(g_1\) into \(g_2\) using edit operations, viz. insertions, deletions, and substitutions of both nodes and edges. The substitution of two nodes u and v is denoted by \((u \rightarrow v)\), the deletion of node u by \((u \rightarrow \varepsilon )\), and the insertion of node v by \((\varepsilon \rightarrow v)\) Footnote 1. A set of edit operations \(\lambda (g_1,g_2) = \{e_1,\ldots , e_k \}\) that transform \(g_1\) completely into \(g_2\) is called an edit path between \(g_1\) and \(g_2\).

Note that edit operations on edges are uniquely defined by the edit operations on their adjacent nodes. That is, whether an edge (uv) is substituted with an existing edge from the other graph, deleted, or inserted actually depends on the operations performed on both adjacent nodes u and v. Thus, we define that an edit path \(\lambda (g_1,g_2)\) explicitly contains the edit operations between the graphs’ nodes \(V_1\) and \(V_2\), while the edge edit operations are implicitly given by these node edit operations.

A cost function that measures the strength of an edit operation is commonly introduced for graph edit distance. The edit distance between two graphs \(g_1\) and \(g_2\) is then defined by the sum of cost of the minimum cost edit path \(\lambda _{\min }\) between \(g_1\) and \(g_2\). In fact, the problem of finding the minimum cost edit path \(\lambda _{\min }\) between \(g_1\) and \(g_2\) can be reformulated as a quadratic assignment problem (QAP). Roughly speaking, QAPs deal with the problem of assigning n entities of a first set \(S = \{s_1, \ldots , s_n \}\) to n entities of a second set \(Q = \{q_1, \ldots , q_n \}\) under some (computationally demanding) side constraints. A common way to formally represent assignments between the entities of S and Q is given by means of permutations \((\varphi _1, \ldots , \varphi _n) \) of the integers \((1,2, \ldots , n)\). A permutation \((\varphi _1, \ldots , \varphi _n)\) refers to the assignment where the first entity \(s_1 \in S\) is mapped to entity \(q_{\varphi _1} \in Q\), the second entity \(s_2 \in S\) is assigned to entity \(q_{\varphi _2} \in Q\), and so on.

By reformulating the graph edit distance problem to an instance of a QAP, two major issues have to be resolved. First, QAPs are generally stated on sets with equal cardinality. Yet, in case of graph edit distance the elements to be assigned to each other are given by the sets of nodes (and edges) with unequal cardinality in general. Second, solutions to QAPs refer to assignments of elements in which every element of the first set is assigned to exactly one element of the second set and vice versa (i.e. a solution to a QAP corresponds to a bijective assignment of the underlying entities). Yet, GED is a more general assignment problem as it explicitly allows both deletions and insertions to occur on the basic entities (rather than only substitutions).

These two issues can be simultaneously resolved by adding an appropriate number of empty “nodes” \(\varepsilon \) to both graphs \(g_1\) and \(g_2\). Formally, assume that \(|V_1| = n\) and \(|V_2|=m\), we extend \(V_1\) and \(V_2\) according to

$$V_1^{+} = V_1 \cup \overbrace{\{\varepsilon _1 ,\ldots , \varepsilon _m\} }^{\textit{m empty nodes}} ~~~\text {and}~~~V_2^{+} = V_2 \cup \underbrace{\{\varepsilon _1 ,\ldots , \varepsilon _n\} }_{\textit{n empty nodes}}. $$

Since both graphs \(g_1\) and \(g_2\) have now an equal number of nodes, viz. \((n+m)\), their corresponding adjacency matrices \(\mathbf {A}\) and \(\mathbf {B}\) offer also equal dimension. These adjacency matrices of \(g_1\) and \(g_2\) are defined by

(1)

If there actually is an edge between node \(u_i \in V_1\) and \(v_j \in V_1\), entry \(a_{ij}\) refers to this edge \((u_i,v_j) \in E_1\), and otherwise to the empty “edge” \(\varepsilon \). Note that there cannot be any edge from an existing node in \(V_1\) to an empty node \(\varepsilon \) and thus the corresponding entries \(a_{ij}\in \mathbf {A}\) with \(i>n\) and/or \(j>n\) are also empty. The same observations account for entries \(b_{ij}\) in \(\mathbf {B}\).

Next, based on the extended node sets \(V^+_1 \text { and } V^+_2\) of \(g_1\) and \(g_2\), respectively, a cost matrix \(\mathbf {C}\) can be established as follows.

(2)

Entry \(c_{ij}\) thereby denotes the cost \(c(u_i \rightarrow v_j)\) of the node substitution \((u_i \rightarrow v_j)\), \(c_{i \varepsilon }\) denotes the cost \(c(u_i \rightarrow \varepsilon )\) of the node deletion \((u_i \rightarrow \varepsilon )\), and \(c_{\varepsilon j}\) denotes the cost \(c(\varepsilon \rightarrow v_j)\) of the node insertion \((\varepsilon \rightarrow v_j)\). Obviously, the left upper part of the cost matrix represents the costs of all possible node substitutions, the right upper part the costs of all possible node deletions, and the bottom left part the costs of all possible node insertions. The bottom right part of the cost matrix is set to zero since substitutions of the form \((\varepsilon \rightarrow \varepsilon )\) should not cause any cost.

Given the adjacency matrices \(\mathbf {A}\) and \(\mathbf {B}\) as well as the cost matrix \(\mathbf {C}\) (Eqs. 1 and 2), the following optimization problem can now be stated.

$$\begin{aligned} (\varphi _1, \ldots , \varphi _{(n+m)}) = \underset{(\varphi _1, \ldots , \varphi _{(n+m)})\in \mathcal {S}_{(n+m)}}{\arg \min }\left[ \sum _{i=1}^{n+m} c_{i\varphi _i} + \sum _{i=1}^{n+m}\sum _{j=1}^{n+m} c (a_{ij} \rightarrow b_{\varphi _i \varphi _j})\right] , \end{aligned}$$

where \(\mathcal {S}_{(n+m)}\) refers to the set of all \((n+m)!\) possible permutations of the integers \(1, 2, \ldots , (n+m)\). Note that this optimal permutation \((\varphi _1, \ldots , \varphi _{(n+m)})\) (as well as any other valid permutation) corresponds to a bijective assignment

$$\lambda = \{(u_{1}\rightarrow v_{\varphi _1}), (u_{2}\rightarrow v_{\varphi _2}), \ldots , (u_{m+n}\rightarrow v_{\varphi _{m+n}})\}$$

of the extended node set \(V_1^{+}\) of \(g_1\) to the extended node set \(V_2^{+}\) of \(g_2\). That is, assignment \(\lambda \) includes node edit operations of the form \((u_i \rightarrow v_j)\), \((u_i \rightarrow \varepsilon )\), \((\varepsilon \rightarrow v_j)\), and \((\varepsilon \rightarrow \varepsilon )\) (the latter can be dismissed, of course). In other words, an arbitrary permutation \((\varphi _1, \ldots , \varphi _{(n+m)})\) perfectly corresponds to a valid edit path \(\lambda \) between two graphs.

The optimization problem stated in Eq. 2.1 exactly corresponds to a standard QAP. Note that the linear term \(\sum _{i=1}^{n+m} c_{i\varphi _i}\) refers to the sum of cost of all node edit operations, which are defined by the permutation \((\varphi _1, \ldots , \varphi _{n+m})\). The quadratic term \(\sum _{i=1}^{n+m}\sum _{j=1}^{n+m} c (a_{ij} \rightarrow b_{\varphi _i \varphi _j})\) refers to the implied edge edit cost defined by the node edit operations. That is, since node \(u_i \in V_1^{+}\) is assigned to a node \(v_{\varphi _i} \in V_2^{+}\) and node \(u_j \in V_1^{+}\) is assigned to a node \(v_{\varphi _j} \in V_2^{+}\), the edge \((u_i,u_j) \in E_1 \cup \{\varepsilon \}\) (stored in \(a_{ij} \in \mathbf {A}\)) has to be assigned to the edge \((v_{\varphi _i},v_{\varphi _j}) \in E_2 \cup \{\varepsilon \}\) (stored in \(b_{\varphi _i \varphi _j} \in \mathbf {B}\)).

2.2 Approximate Computation of GED

In fact, QAPs are very hard to solve as they belong to the class of NP-hard problems. Authors of the present paper introduced an algorithmic framework which allows the approximation of graph edit distance in a substantially faster way than traditional methods [11]. The basic idea of this approach is to reduce the QAP of graph edit distance computation to an instance of a Linear Sum Assignment Problem (LSAP). LSAPs are similar to QAPs in the sense of also formulating an assignment problem of entities. Yet, in contrast with QAPs, LSAPs are able to optimize the permutation \((\varphi _1, \ldots , \varphi _{(n+m)}) \) with respect to the linear term \(\sum _{i=1}^{n+m} c_{i\varphi _i} \) only. That is, LSAPs consider a single cost matrix \(\mathbf {C}\) without any side constraints. For solving LSAPs a large number of efficient (i.e. polynomial) algorithms exist (see [12] for an exhaustive survey on LSAP solvers).

Yet, by omitting the quadratic term \(\sum _{i=1}^{n+m}\sum _{j=1}^{n+m} c (a_{ij} \rightarrow b_{\varphi _i \varphi _j})\) during the optimization process, we neglect the structural relationships between the nodes (i.e. the edges between the nodes). In order to integrate knowledge about the graph structure, to each entry \(c_{ij} \in \mathbf {C}\), i.e. to each cost of a node edit operation \((u_i \rightarrow v_j)\), the minimum sum of edge edit operation costs, implied by the corresponding node operation, can be added. Formally, for every entry \(c_{ij}\) in the cost matrix \(\mathbf {C}\) one might solve an LSAP on the ingoing and outgoing edges of node \(u_i\) and \(v_j\) and add the resulting cost to \(c_{ij}\). That is, we define

$$c^*_{ij} = c_{ij + }\underset{(\varphi _1, \ldots , \varphi _{(n+m)})\in \mathcal {S}_{(n+m)}}{\min }\sum _{k=1}^{n+m} c(a_{ik} \rightarrow b_{j\varphi _k}) + c(a_{ki} \rightarrow b_{\varphi _kj}),$$

where \(\mathcal {S}_{(n+m)}\) refers to the set of all \((n+m)!\) possible permutations of the integers \(1, \ldots , (n+m)\). To entry \(c_{i\varepsilon }\), which denotes the cost of a node deletion, the cost of the deletion of all incident edges of \(u_i\) can be added, and to the entry \(c_{\varepsilon j}\), which denotes the cost of a node insertion, the cost of all insertions of the incident edges of \(v_j\) can be added. We denote the cost matrix which is enriched with structural information with \(\mathbf {C}^{*} =(c^*_{ij})\) from now on.

In [11] the cost matrix \(\mathbf {C^*} = (c^*_{ij})\) as defined above is employed in order to optimally solve the LSAP by means of Munkres Algorithm [15]Footnote 2. The LSAP optimization consists in finding a permutation \((\varphi ^*_1, \ldots , \varphi ^*_{n+m})\) of the integers \((1,2, \ldots , (n+m))\) that minimizes the overall assignment cost \(\sum _{i=1}^{(n+m)} c^*_{i\varphi ^*_i}\). Similar to the permutation \((\varphi _1, \ldots , \varphi _{n+m})\) obtained on the QAP, the permutation \((\varphi ^*_1, \ldots , \varphi ^*_{n+m})\) corresponds to a bijective assignment of the entities in \(V^+_1\) to the entities in \(V^+_2\). In other words, the permutation \((\varphi ^*_1, \ldots , \varphi ^*_{(n+m)})\) refers to an admissible and complete (yet not necessarily minimal cost) edit path between the graphs under consideration. We denote this approximation framework with BP-GED from now on.

Recently, it has been proposed to solve the LSAP stated on \(\mathbf {C}^*\) with an approximation rather than with an exact algorithm [13, 14]. This algorithm iterates through \(\mathbf {C}^*\) from top to bottom through all rows and assigns every element to the minimum unused element in a greedy manner. Clearly, the complexity of this suboptimal assignment algorithm is \(O((n+m)^2)\). For the remainder of this paper we denote the graph edit distance approximation where the LSAP on \(\mathbf {C}^*\) is solved by means of this greedy procedure with GR-GED.

3 Building the Utility Matrix

Similar to [13, 14] we aim at solving the basic LSAP in \(O(n^2)\) time in order to approximate the graph edit distance. Yet, in contrast with this previous approach, which considers the cost matrix \(\mathbf {C}^*=(c^*_{ij})\) directly as its basis, we transform the given cost matrix into a utility matrix with equal dimension as \(\mathbf {C}^*\) and work with this matrix instead.

The rationale behind this transformation is based on the following observation. When picking the minimum element \(c_{ij}\) from cost matrix \(\mathbf {C}^*\), i.e. when assigning node \(u_i\) to \(v_j\), we exclude both nodes \(u_i\) and \(v_j\) from any future assignment. However, it may happen that node \(v_j\) is not only the best choice for \(u_i\) but also for another node \(u_k\). Because \(v_j\) is no longer available, we may be forced to map \(u_k\) to another, very expensive node \(v_l\), such that the total assignment cost becomes higher than mapping node \(u_i\) to some node that is (slightly) more expensive than \(v_j\). In order to take such situations into account, we incorporate additional information in the utility matrix about the the minimum and maximum value in each row, and each column.

Let us consider the i-th row of the cost matrix \(\mathbf {C}^*\) and let \(\textit{row-min}_{i}\) and \(\textit{row-max}_{i}\) denote the minimum and maximum value occurring in this row, respectively. Formally, we have

$$\textit{row-min}_{i}= \min _{j=1,\ldots , (n+m)} c^*_{ij} ~~~~~\text { and }~~~ \textit{row-max}_{i} = \max _{j=1,\ldots , (n+m)} c^*_{ij}.$$

If the node edit operation \((u_i \rightarrow v_j)\) is selected, one might interpret the quantity

$$\textit{row-win}_{ij} = \frac{\textit{row-max}_{i} - c^*_{ij}}{\textit{row-max}_{i} - \textit{row-min}_{i} }$$

as a win for \((u_i \rightarrow v_j)\), when compared to the locally worst case situation where \(v_k\) with \(k = \arg \max _{j=1,\ldots , (n+m)} c^*_{ij}\) is chosen as target node for \(u_i\). Likewise, we might interpret

$$\textit{row-loss}_{ij} = \frac{c^*_{ij} - \textit{row-min}_{i}}{\textit{row-max}_{i} - \textit{row-min}_{i}}$$

as a loss for \((u_i \rightarrow v_j)\), when compared to selecting the minimum cost assignment which would be possible in this row. Note that both \(\textit{row-win}_{ij}\) and \(\textit{row-loss}_{ij}\) are normalized to the interval [0, 1]. That is, when \(c^*_{ij} = \textit{row-min}_{i}\) we have a maximum win of 1 and a minimum loss of 0. Likewise, when \(c^*_{ij} = \textit{row-max}_{i}\) we observe a minimum win of 0 and a maximum loss of 1.

Overall we define the utility of the node edit operation \((u_i \rightarrow v_j)\) with respect to row i as

$$\textit{row-utility}_{ij} = \textit{row-win}_{ij} - \textit{row-loss}_{ij} = \frac{\textit{row-max}_{i} + \textit{row-min}_{i} - 2c^*_{ij}}{\textit{row-max}_{i} - \textit{row-min}_{i}}.$$

Clearly, when \(c_{ij} = \textit{row-min}_{i}\) we observe a row utility of \(+1\), and vice versa, when \(c_{ij} = \textit{row-max}_{i}\) we have a row utility of \(-1\).

So far the utility of a node edit operation \((u_i \rightarrow v_j)\) is quantified with respect to the i-th row only. In order to take into account information about the j-th column, we seek for the minimum and maximum values that occur in column j by

$$\textit{col-min}_{j}= \min _{i=1,\ldots , (n+m)} c^*_{ij} \quad \text { and }~~~ \textit{col-max}_{j} = \max _{i=1,\ldots , (n+m)} c^*_{ij}.$$

Eventually, we define

$$\textit{col-win}_{ij} = \frac{\textit{col-max}_{j} - c^*_{ij}}{\textit{col-max}_{j} - \textit{col-min}_{j} } ~~~~ \text {and} ~~~ \textit{col-loss}_{ij} = \frac{c^*_{ij} - \textit{col-min}_{j}}{\textit{col-max}_{j} - \textit{col-min}_{j}}.$$

Similarly to the utility of the node edit operation \((u_i \rightarrow v_j)\) with respect to row i we may define the utility of the same edit operation with respect to column j as

$$\textit{col-utility}_{ij} = \textit{col-win}_{ij} - \textit{col-loss}_{ij} = \frac{\textit{col-max}_{j} + \textit{col-min}_{j} - 2c^*_{ij}}{\textit{col-max}_{j} - \textit{col-min}_{j}}.$$

To finally estimate the utility \(u_{ij}\) of a node edit operation \((u_i \rightarrow v_j)\) with respect to both row i and column j we compute the sum

$$u_{ij} = \textit{row-utility}_{ij}+\textit{col-utility}_{ij}.$$

Since both \(\textit{row-utility}_{ij}\) and \(\textit{col-utility}_{ij}\) lie in the interval \([-1,1]\), we have \(u_{ij} \in [-2,2]\) for \(i,j = 1, \ldots , (n+m)\). We denote the final utility matrix by \(\mathbf {U}=(u_{ij})\).

4 Experimental Evaluation

In the experimental evaluation we aim at investigating the benefit of using the utility matrix \(\mathbf {U}\) instead of the cost matrix \(\mathbf {C}^*\) in the framework GR-GED. In particular, we aim at assessing the quality of the different distance approximations by means of comparisons of the sum of distances and by means of a distance based classifier. Actually, a nearest-neighbor classifier (NN) is employed. Note that there are various other approaches to graph classification that make use of graph edit distance in some form. Yet, the nearest neighbor paradigm is particularly interesting for the present evaluation because it directly uses the distances without any additional classifier training.

We use three real world data sets from the IAM graph database repository [16]Footnote 3. Two graph data sets involve graphs that represent molecular compounds (AIDS and MUTA). These data set consists of two classes, which represent molecules with activity against HIV or not (AIDS), and molecules with and without the mutagen property (MUTA), respectively. The third data set consists of graphs representing proteins stemming from six different classes (PROT).

Table 1. The mean run time for one matching (), the relative increase of the sum of distances compared with BP-GED, and the recognition rate (rr) of a nearest-neighbor classifier using a specific graph edit distance algorithm.

In Table 1 the results obtained with three different graph edit distance approximations are shown. The first algorithm is BP-GED(\(\mathbf {C}^*\)), which solves the LSAP on \(\mathbf {C}^*\) in an optimal manner in cubic time [11]. The second algorithm is GR-GED(\(\mathbf {C}^*\)), which solves the LSAP on \(\mathbf {C}^*\) in a greedy manner in quadratic time [13, 14]. Finally, the third algorithm is GR-GED(\(\mathbf {U}\)), which operates on the utility matrix \(\mathbf {U}\) instead of \(\mathbf {C}^*\) (also using the greedy assignment algorithm).

We first focus on the mean run time for one matching in ms () and compare BP-GED with GR-GED that operates on the original cost matrix \(\mathbf {C}^*\). On all data sets substantial speed-ups of GR-GED(\(\mathbf {C}^*\)) can be observed. On the AIDS data set, for instance, the greedy approach GR-GED(\(\mathbf {C}^*\)) is approximately three times faster than BP-GED. On the MUTA data set the mean matching time is decreased from 33.89ms to 4.56ms (seven times faster) and on the PROT data the greedy approach approximately halves the matching time (25.43 ms vs. 13.31 ms). Comparing GR-GED(\(\mathbf {C}^*\)) with GR-GED(\(\mathbf {U}\)) we observe only a small increase of the matching time when the latter approach is used. The slight increase of the run time, which is actually observable on all data sets, is due to the computational overhead that is necessary for transforming the cost matrix \(\mathbf {C}^*\) to the utility matrix \(\mathbf {U}\).

Next, we focus on the distance quality of the greedy approximation algorithms. Note that all of the employed algorithms return an upper bound on the true edit distance, and thus, the lower the sum of distances of a specific algorithm is, the better is its approximation quality. For our evaluation we take the sum of distances returned by BP-GED as reference point and measure the relative increase of the sum of distances when compared with BP-GED (sod). We observe that GR-GED(\(\mathbf {C}^*\)) increases the sum of distances by 1.92 % on the AIDS data when compared with BP-GED. On the other two data sets the sum of distances is also increased (by 1.50 % and 10.86 %, respectively). By using the utility matrix \(\mathbf {U}\) rather than the cost matrix \(\mathbf {C}\) in the greedy assignment algorithm, we observe smaller sums of distances on the MUTA and PROT data sets. Hence, we conclude that GR-GED(\(\mathbf {U}\)) is able to produce more accurate approximations than GR-GED(\(\mathbf {C}\)) in general.

Finally, we focus on the recognition rate (rr) of a NN-classifier that uses the different distance approximations. We observe that the NN-classifier that is based on the distances returned by GR-GED(\(\mathbf {C}^*\)) achieves lower recognition rates than the same classifier that uses distances from BP-GED (on all data sets). This loss in recognition accuracy may be attributed to the fact that the approximations in GR-GED are coarser than those in BP-GED. Yet, our novel procedure, i.e. GR-GED(\(\mathbf {U}\)), improves the recognition accuracy on all data sets when compared to GR-GED(\(\mathbf {C}^*\)). Moreover, we observe that GR-GED(\(\mathbf {U}\)) is inferior to BP-GED in two out of three cases only.

5 Conclusions and Future Work

In the present paper we propose to use a utility matrix instead of a cost matrix for the assignment of local substructures in a graph. The motivation for this transformation is based on the greedy behavior of the basic assignment algorithm. More formally, with the transformation of the cost matrix into a utility matrix we aim at increasing the probability of selecting a correct node edit operation during the optimization process. With an experimental evaluation on three real world data sets, we empirically confirm that our novel approach is able to increase the accuracy of a distance based classifier, while the run time is nearly not affected.

In future work we aim at testing other (greedy) assignment algorithms on the utility matrix \(\mathbf {U}\). Moreover, there seems to be room for developing and researching variants of the utility matrix with the aim of integrating additional information about the trade-off between wins and losses of individual assignments.