Keywords

1 Introduction

The nature of real-world problems has driven their modeling to structured objects where samples are usually represented by nodes and their relationship are represented by connections (edges), similarly to a graph [1, 2]. Among these problems, one can mention studies in bioinformatics, social networks, and data mining in documents, just to name a few, in which it is necessary to identify structures or elements that share similar properties.

Another very common characteristic present in real-world problems is the non-linear distribution, which requires complex models to identify the different existing patterns in the dataset [3]. In this matter, kernels have been proposed as a tool to overcome this issue by providing a way to map the data into a space of higher dimension, where the input data may be linearly separable.

Support Vector Machines (SVM) is perhaps the most well-known technique that makes use of kernels for data embedding into higher dimensional feature spaces [4]. There are also the kernel-PCA (Principal Component Analysis) [5], and the kernel-Fisher discriminant analysis [6], just to mention a few. Although the mapping can take a feature space to a very high dimension that increases the computing cost, both training and classification steps can be performed by means of a dot product, the so-called kernel trick. Such procedure consists in computing the dot product without the need of mapping the samples to a higher dimensional space. The combination of the these two worlds, i.e., representations based on graphs and non-linear distributions, has motivated studies in the application of kernels in structured data [1, 7]. In this research area, it is typical to find two different approaches, i.e., applications that make use of kernels to identify similarity among graphs, while others focus on the similarity among the graph nodes.

One of the first works on kernels among graphs was proposed by Gärtner et al. [8] and Borgwardt et al. [9] using the random walk, and marginalized kernels by Tsuda et al. [10], Kashima et al. [11, 12] and Mahé et al. [13]. The basic idea of random-walk graph kernels is to perform a random walk in a pair of graphs and sum up the number of matching walks [2] on both graphs. The marginalized kernel is defined as the inner product of the count vectors averaged over all possible label paths [11]. Regarding kernels in graphs, interesting works were proposed by Kondor and Lafferty [1] and Smola and Kondor [14]. In [1], the authors proposed the diffusion kernels, which is a special class of exponential kernels based on the heat equation.

Graph-based machine learning techniques can be noticed as well. The Optimum-Path Forest (OPF) framework [15,16,17,18] is a graph-based classification algorithm where samples are represented as graph nodes and each node can be also weighted by a probability density function that takes into account its k-nearest neighbors, and the edges are weighted by a distance function. The partitioning using the OPF algorithm is performed by a competition process, in which prototypes try to conquer the remaining samples by offering them paths with optimum cost. The result is a set of trees (forest) rooted at a set of prototypes, where each tree may represent a single class or cluster. The OPF algorithm has been applied in a large variety of applications, such as network security [19], Parkinson’s disease identification [20], clustering [21, 22], and biometrics [23, 24], just to name a few, and it has showed competitive results against some state-of-art and well-known machine learning algorithms.

The current OPF algorithm implementation works naturally with non-linear situations, but it does not map samples from one space to another. In this paper, we take one step further by modifying the OPF algorithm to work with kernels on graphs in order to improve its training and classification results, since such approach has not been applied so far. The performance of the proposed approach is assessed under three different kernels, and it is compared against the original OPF algorithm and the well-known SVM in 11 different datasets. The remainder of this paper is organized as follows. Sections 2 and 3 present the theoretical background concerning OPF and its kernel-based variant, respectively. Section 4 discusses the methodology and experiments, and Sect. 5 states the conclusions.

2 Optimum-Path Forest

In this section, we explain the OPF working mechanism considering the first proposed version [16, 17]. Roughly speaking, the OPF classifier models the problem of pattern recognition as a graph partition in a given feature space. The nodes are represented by the feature vectors and the edges connect all pairs of them, defining a full connectedness graph. The partition of the graph is performed through a competition process among some key samples (prototypes), which offer optima paths to the remaining nodes of the graph. Each prototype sample defines its own optimum-path tree (OPT), and the collection of all OPTs defines an optimum-path forest, which gives the name to the classifier.

Let \(\mathcal{Z}=\mathcal{Z}_1\cup \mathcal{Z}_2\) be a dataset labeled with a function \(\lambda \), in which \(\mathcal{Z}_1\) and \(\mathcal{Z}_2\) stand for the training and test sets, respectively, such that \(\mathcal{Z}_1\) is used to train a given classifier and \(\mathcal{Z}_2\) is used to assess its accuracy. Let \(\mathcal{S}\subseteq \mathcal{Z}_1\) be a set of prototype samples. Essentially, the OPF classifier creates a discrete optimal partition of the feature space such that any sample \(\mathbf s \in \mathcal{Z}_2\) can be classified according to this partition.

The OPF algorithm may be used with any smooth path-cost function which can group samples with similar properties [25]. Papa et al. [16, 17] employed the path-cost function \(f_{max}\), which is computed as follows:

$$\begin{aligned} f_{max}(\langle \mathbf s \rangle )= & {} \left\{ \begin{array}{ll} 0 &{} \text{ if } {} \mathbf s \in \mathcal{S}, \\ +\infty &{} \text {otherwise} \end{array}\right. \nonumber \\ f_{max}(\pi \cdot \langle \mathbf s ,\mathbf t \rangle )= & {} \max \{f_{max}(\pi ),d(\mathbf s ,\mathbf t )\}, \end{aligned}$$
(1)

in which \(d(\mathbf s ,\mathbf t )\) denotes the distance between samples \(\mathbf s \) and \(\mathbf t \), and a path \(\pi \) is defined as a sequence of adjacent samples. As such, we have that \(f_{max}(\pi )\) computes the maximum distance among adjacent samples in \(\pi \), when \(\pi \) is not a trivial path.

The OPF algorithm assigns one optimum path \(P^{*}(\mathbf s )\) from \(\mathcal{S}\) to every sample \(\mathbf s \in \mathcal{Z}_1\), forming an optimum path forest P (a function with no cycles that assigns to each \(s\in \mathcal{Z}_1\backslash \mathcal{S}\) its predecessor \(P(\mathbf s )\) in \(P^{*}(\mathbf s )\) or a marker nil when \(\mathbf s \in \mathcal{S}\). Let \(R(\mathbf s )\in \mathcal{S}\) be the root of \(P^{*}(\mathbf s )\) that can be reached from \(P(\mathbf s )\). The OPF algorithm computes for each \(\mathbf s \in \mathcal{Z}_1\), the cost \(C(\mathbf s )\) of \(P^{*}(\mathbf s )\), the label \(L(\mathbf s )=\lambda (R(\mathbf s ))\), and the predecessor \(P(\mathbf s )\).

2.1 Training

In the training phase, the OPF algorithm aims to find the set \(\mathcal{S}^*\), that is the optimum set of prototypes, by minimizing the classification errors for every \(\mathbf s \in \mathcal{Z}_1\) through the exploitation of the theoretical relation between minimum-spanning tree (MST) and optimum-path tree (OPT) for \(f_{max}\) [26]. The training essentially consists in finding \(\mathcal{S}^*\) from \(\mathcal{Z}_1\) and an OPF classifier rooted at \(\mathcal{S}^*\).

By computing a MST, we obtain a connected acyclic graph whose nodes are all samples of \(\mathcal{Z}_1\) and the arcs are undirected and weighted by the distances d between adjacent samples. The spanning tree is optimum in the sense that the sum of its arc weights is minimum as compared to any other spanning tree in the complete graph. In the MST, every pair of samples is connected by a single path which is optimum according to \(f_{max}\). That is, the minimum-spanning tree contains one optimum-path tree for any selected root node. The optimum prototypes are the closest elements of the MST with different labels in \(\mathcal{Z}_1\) (i.e., elements that fall in the frontier of the classes). After finding the prototypes, the competition process takes place to build the optimum-path forest.

2.2 Classification

For any sample \(\mathbf t \in \mathcal{Z}_2\), we consider all arcs connecting \(\mathbf t \) with samples \(\mathbf s \in \mathcal{Z}_1\), as though \(\mathbf t \) were part of the training graph. Considering all possible paths from \(\mathcal{S}^*\) to \(\mathbf t \), we find the optimum path \(P^*(\mathbf t )\) from \(\mathcal{S}^*\) and label \(\mathbf t \) with the class \(\lambda (R(\mathbf t ))\) of its most strongly connected prototype \(R(\mathbf t )\in \mathcal{S}^*\). This path can be identified incrementally by evaluating the optimum cost \(C(\mathbf t )\) as

$$\begin{aligned} C(\mathbf t ) = \min \{ \max \{C(\mathbf s ),d(\mathbf s ,\mathbf t )\}\},\ \forall \mathbf s \in \mathcal{Z}_1. \end{aligned}$$
(2)

Let the node \(\mathbf s ^*\in \mathcal{Z}_1\) be the one that satisfies Eq. 2 (i.e., the predecessor \(P(\mathbf t )\) in the optimum path \(P^*(\mathbf t )\)). Given that \(L(\mathbf s ^*)=\lambda (R(\mathbf t ))\), the classification simply assigns \(L(\mathbf s ^*)\) as the class of \(\mathbf t \). An error occurs when \(L(\mathbf s ^*)\ne \lambda (\mathbf t )\).

3 Proposed Approach

The proposed kernel-based OPF, hereinafter called kOPF, works similarly to SVM algorithm, in which samples are mapped into a feature space of higher dimension. In the SVM context, such mapping is performed as an attempt to make the data linearly separable. The OPF, on the other hand, naturally works with non-linear data. Therefore, the main idea of this work is to evaluate OPF’s behavior under such assumption of samples’ separability in higher dimensional spaces.

Let \(\varPhi (\cdot ,\cdot )\) be a kernel function that generates a new dataset \(\mathcal{X} = \varPhi (\mathcal{Z})\). Given a sample \(\mathbf p \in \mathcal{Z}\), such that \(\mathbf p \in \mathfrak {R}^n\), its new representation \(\hat{\mathbf{p }} \in \mathcal{X}\) is defined as follows.

$$\begin{aligned} \hat{\mathbf{p }} = (\varPhi _1, \varPhi _2, \ldots , \varPhi _{|\mathcal{Z}_1|}), \end{aligned}$$
(3)

where \(\varPhi _i=(\mathbf p ,\mathbf s _i)\), \(\mathbf s _i \in \mathcal{Z}_{1}\). Notice that \(\hat{\mathbf{p }} \in \mathfrak {R}^{|\mathcal{Z}_1|}\), which means the new sample \(\mathbf p \) contains as many dimensions as the number of training samples.

In short, \(\varPhi (\mathbf p , \mathbf s _i)\) makes use of a distance function (i.e., Euclidean, Mahalanobis, among others) to compute a term that replaces the norm in kernel functions, such as Radial Basis Function (RBF) and Sigmoid, for instance. The aforementioned term corresponds to the distance between a sample to be mapped \(\mathbf p \) and a training sample \(\mathbf s \). Basically, the mapping performed by kOPF is carried out by computing a feature vector, where each component has the distance value from the sample to be mapped (either training or testing sample) to a different training sample applied to a kernel function. It is important to highlight that large training sets may cause a significant increase on the size of the new feature vector, since the size is dependent on the number of training samples \(|\mathcal{Z}_1|\).

4 Methodology and Experiments

The kOPF classifier has both its performance and accuracy assessed by means of 11 public benchmarking datasetsFootnote 1 that provide different classification scenarios. The implementation of our proposed approach is developed over the LibOPF [27], being standard OPF and SVM used as baselines for the experiments. With respect to SVM, we used the well-known LibSVMFootnote 2. Table 1 presents detailed information from the datasets.

Table 1. Information about the datasets used in the experiments.

Since the kernel function can influence the final accuracy, we evaluated its impact by applying three different kernel functions for kOPF as follows:

  • Identity: \(\varPhi (\mathbf p , \mathbf s ) = \Vert \mathbf p , \mathbf s \Vert \)

  • RBF: \(\varPhi (\mathbf p , \mathbf s ) = e^{-(\gamma \Vert \mathbf p , \mathbf s \Vert ^2)}\)

  • Sigmoid: \(\varPhi (\mathbf p , \mathbf s ) = \tanh (\gamma \Vert \mathbf p , \mathbf s \Vert +C)\)

where \(\Vert \mathbf p , \mathbf s \Vert \) denotes the Euclidean distance. Notice the Identity kernel is parameterless. The SVM was also evaluated using three different kernel functions: linear, RBF and Sigmoid. The situations in which parameters C and \(\gamma \) are required, it is performed their optimization using the intervals \(C = [-32, 32]\) and \(\gamma = [0, 32]\) with steps equals to 2 for both of them.

The classification experiments were conducted by means of a hold-out process using 15 runs, in which both training and testing sets were randomly generated in each run and always having a number of samples equals to 50% of the dataset size. The experiments also evaluated the impact in the accuracy rate when features are normalized. Tables 2 and 3 present the mean accuracy results considering non-normalized and normalized datasets, respectively. The accuracy rates were computed using the accuracy measure proposed by Papa et al. [16], which considers unbalanced data. The best results according to the Wilcoxon signed-rank test with significance 0.05 are shown in bold.

Table 2. Mean classification rates for non-normalized features.
Table 3. Mean classification rates for normalized features.

In the non-normalized feature scenario, SVM-RBF achieved the best results (or similar) in 9 out of 11 datasets, followed by the kOPF (kOPF-Identity and kOPF-RBF) with 7 out of 11 (being kOPF-Identity the best in 6 out of 11). The standard OPF obtained the best results in 5 out of 11 datasets. Considering normalized features, the kOPF (kOPF-Identity, kOPF-RBF and kOPF-Sigmoid) obtained the best results (or similar) in 8 out of 11 datasets (being kOPF-Identity the best in 5 out of 11), followed by SVM (SVM-Linear, SVM-RBF and SVM-Sigmoid) with 7 out of 11 (being SVM-RBF the best in 6 out of 11). The OPF obtained the best results in only 4 out of 11 datasets.

In both (non-normalized and normalized) scenarios, the proposed kOPF outperformed the traditional OPF in most datasets, and for normalized features kOPF outperformed the SVM in some datasets as well. The results are quite interesting, since kOPF was able to improve OPF and outperforming SVM in some datasets. Considering some other datasets, although kOPF did not outperform both OPF and SVM, their results were considerably close.

The experiments also comprised the analysis of computational load required by each technique in each dataset. The results showed OPF and kOPF require a considerably small computational load in the training phases when compared against SVM. The high training time consumption turn the SVM prohibitive in real-time learning systems, specially if the training set is very dynamic over time. In this situation, both OPF and kOPF seems to be the most suitable approach. Due to lack of space, it was not possible to insert the computational load results of each technique, but OPF-based approaches have been around 200 times faster than SVM for training in the larger datasets.

5 Conclusions

This paper introduced a kernel-based OPF, which is a modification made over the standard OPF classifier that allows the usage of different kernel functions for both learning and classification. In our proposed approach, the mapping makes use of distance metric whose results are applied to kernel functions, such as RBF and Sigmoid. The main goal of such modification is to improve the accuracy rate.

The evaluation using 11 benchmark datasets and three different kernels showed the proposed approach achieved very interesting results, in which the application of kernel functions improved the accuracy rate of the traditional OPF, and even outperformed the well-known SVM when features were normalized. In summary, kOPF achieved satisfactory results and is an interesting option for classification, specially when training sets are very dynamic due to its low computational load for training purposes.