Keywords

1 Introduction

Non-linear support vector machines (SVMs) are a powerful family of classifiers. However, while in recent years one has seen considerable advancements on scaling kernel SVMs to large-scale problems [1, 9], the lack of sparsity in the obtained models, i.e., the often large number of support vectors, remains an issue in contexts where the run time complexity of the classifier is a critical factor [6, 13]. This is the case in applications such as object detection in images or web search ranking, which require repeated and fast evaluations of the model. As the sparsity of a non-linear kernel SVM cannot be known a-priori, it is crucial to devise efficient methods to impose sparsity in the model or to sparsify an existing classifier while preserving as much of its generalization capability as possible.

Recent attempts to achieve this goal include post-processing approaches that reduce the number of support vectors in a given SVM or change the basis used to express the classifier [3, 12, 15], and direct methods that modify the SVM objective or introduce heuristics during the optimization to maximize sparsity [2, 6, 7, 11, 14]. In a recent breakthrough, [3] proposed a simple technique to reduce the number of support vectors in a given SVM showing that it was asymptotically optimal and outperformed many competing approaches in practice. Unfortunately, most of these techniques either depend on several parameter and heuristic choices to yield a good performance or demand significant computational resources. In this paper, we argue how these problems can be effectively circumvented by sparsifying an SVM solving a simple Lasso problem [4] in the kernel space. Interestingly, this criterion was already mentioned in [12], but was not accompanied by an efficient algorithm neither systematically assessed in practice. By exploiting recent advancements in optimization [4, 5, 9], we devise an algorithm that is significantly cheaper than [3] in terms of optimization and parameter selection but is competitive in terms of the accuracy/sparsity tradeoff.

2 Problem Statement and Related Work

Given data \(\{(\mathbf {x}_{i}, y_i)\}_{i=1}^m\) with \(\mathbf {x}_{i} \in X\) and \(y_i \in \{\pm 1\}\), SVMs learn a predictor of the form \(f_{\mathbf {w},b}(\mathbf {x}) = \text{ sign }(\mathbf {w}^T\phi (\mathbf {x}) + b)\) where \(\phi (\mathbf {x})\) is a feature vector representation of the input pattern \(\mathbf {x}\) and \(\mathbf {w}\in \mathcal {H},b \in \mathbb {R}\) are the model parameters. To allow more flexible decision boundaries, \(\phi (\mathbf {x})\) often implements a non-linear mapping \(\phi : X \rightarrow \mathcal {H}\) of the input space into a Hilbert space \(\mathcal {H}\), related to X by means of a kernel function \(k:X\times X \rightarrow \mathbb {R}\). The kernel allows to compute dot products in \(\mathcal {H}\) directly from X, using the property \(\phi (\mathbf {x}_i)^{T}\phi (\mathbf {x}_{j}) = k(\mathbf {x}_{i}, \mathbf {x}_{j}), \, \forall \, \mathbf {x}_{i},\mathbf {x}_{j}\in X\). The values of \(\mathbf {w},b\) are determined by solving a problem of the form

$$\begin{aligned} \mathbf {min}_{\mathbf {w},b} \, \, \tfrac{1}{2} \Vert \mathbf {w}\Vert _{\mathcal {H}}^2 + C\sum \nolimits _{i=1}^m \ell \left( y_i(\mathbf {w}^T\phi (\mathbf {x}_i) + b)\right) ^{p} , \end{aligned}$$
(1)

where \(p\in \{1,2\}\) and \(\ell (z)=(1-z)_{+}\) is called the hinge-loss. It is well-known that the solution \(\mathbf {w}*\) to (1) can be written as a linear combination of the training patterns in the feature space \(\mathcal {H}\). This leads to the “kernelized” decision function

$$\begin{aligned} f_{\mathbf {w},b}(\mathbf {x})&= \text{ sign }\left( \mathbf {w}^{*T}\phi (\mathbf {x}) + b^{*}\right) = \text{ sign }\left( \sum \nolimits _{i=1}^{n} y_i \beta _i^{*} k(\mathbf {x}_i,\mathbf {x}) + b^{*}\right) , \end{aligned}$$
(2)

whose run time complexity is determined by the number \({n_{\mathrm {\tiny sv}}}\) of examples such that \(\beta _i^{*} \ne 0\). These examples are called the support vectors (SVs) of the model. In contrast to the linear case (\(\phi (\mathbf {x}) = \mathbf {x}\)), kernel SVMs need to explicitly store and access the SVs to perform predictions. Unfortunately, it is well known that in general, \({n_{\mathrm {\tiny sv}}}\) grows as a linear function of the number of training points [6, 13] (at least all the misclassified points are SVs) and therefore \({n_{\mathrm {\tiny sv}}}\) is often too large in practice, leading to classifiers expensive to store and evaluate. Since \({n_{\mathrm {\tiny sv}}}\) is the number of non-zero entries in the coefficient vector \(\varvec{\beta }^{*}\), this problem if often referred in the literature as the lack of sparsity of non-linear SVMs.

Methods to address this problem can be categorized in two main families: post-processing or reduction methods, which, starting with a non-sparse classifier, find a more efficient predictor preserving as much of the original predictive accuracy as possible, and direct methods that modify the training criterion (1) or introduce heuristics during its optimization to promote sparsity. The first category include methods selecting a subset of the original support vectors to recompute the classifier [12, 15], techniques to substitute the original support vectors by arbitrary points of the input space [10] and methods tailored to a specific class of SVM [8]. The second category includes offline [6, 7, 11], as well as online learning algorithms [2, 14]. Unfortunately, most of these techniques either incur in a significant computational cost or depend on several heuristic choices to yield a good performance. Recently, a simple, yet asymptotically optimal reduction method named issvm has been presented in [3], comparing favorably with the state of the art in terms of the accuracy/sparsity tradeoff. The method is based on the observation that the hinge loss of a predictor \(f_{\mathbf {w},b}\) can be approximately preserved using a number of support vectors proportional to \(\Vert \mathbf {w}\Vert _{\ell _2}\) by applying sub-gradient descent to the minimization of the following objective function

(3)

where \(h_i = \text{ max }(1,y_i(\mathbf {w}^T\mathbf {x}+ b))\). Using this method to sparsify an SVM \(f_{\mathbf {w},b}\) guarantees a reduction of \({n_{\mathrm {\tiny sv}}}\) to at most \(\mathcal {O}(\Vert \mathbf {w}\Vert _{\ell _2})\) support vectors. However, since different levels of sparsification may be required in practice, the algorithm is equipped with an additional projection step. In the course of the optimization, the approximation \(\tilde{\mathbf {w}}\) is projected into the \(\ell _2\)-ball of radius \(\delta \), where \(\delta \) is a parameter controlling the level of sparsification. Unfortunately, the inclusion of this projection step and the weak convergence properties of sub-gradient descent makes the algorithm quite sensitive to parameter tuning.

3 Sparse SVM Approximations via Kernelized Lasso

Suppose we want to sparsify an SVM with parameters \(\mathbf {w}_{*},b_{*}\), kernel \(k(\cdot ,\cdot )\) and support set \(S = \{(\mathbf {x}_{(i)}, y_{(i)})\}_{i=1}^{n_{\mathrm {\tiny sv}}}\). Let \(\phi : X \rightarrow \mathcal {H}\) be the feature map implemented by the kernel and \(\phi (\mathbf {S})\) the matrix whose i-th column is given by \(\phi (\mathbf {x}_{(i)})\). With this notation, \(\mathbf {w}_{*}\) can be written as \(\mathbf {w}_{*} =\phi (\mathbf {S})\varvec{\alpha }_{*}\) with \(\varvec{\alpha }_{*} \in \mathbb {R}^{{n_{\mathrm {\tiny sv}}}}\) Footnote 1. In this paper, we look for approximations of the form \(\mathbf {u}=\phi (\mathbf {S})\varvec{\alpha }\) with sparse \(\mathbf {u}\). Support vectors such that \(u_{(i)}=0\) are pruned from the approximation.

Our approximation criterion is based on two observations. The first is that the objective function (3) can be bounded by a differentiable function which is more convenient for optimization. Importantly, this function also bounds the expected loss of accuracy incurred by the approximation. Indeed, the following result (whose proof we omit for space constraints) holds:

Proposition 1

Consider an SVM implementing the decision function \(f_{\mathbf {w},b}(\mathbf {x}) = \text{ sign }(\mathbf {w}^T\phi (\mathbf {x}) + b)\) and an alternative decision function \(f_{\mathbf {u},b}(\mathbf {x})= \text{ sign }(\mathbf {u}^T\phi (\mathbf {x}) + b)\), with \(\mathbf {u}\in \mathcal {H}\). Let \(\ell (z)\) be the hinge loss. Then, \(\exists M>0\) suchthat

The result above suggests that we can substitute \(\mathbf {w}\in \mathcal {H}\) in the original SVM by some \(\mathbf {u}\in \mathcal {H}\) such that \(\Vert \mathbf {u}- \mathbf {w}_{*}\Vert ^2\) is small. However, the obtained surrogate does need to be sparse. Indeed, minimizing \(\Vert \mathbf {u}- \mathbf {w}_{*}\Vert ^2\) in \(\mathcal {H}\) trivially yields the original predictor \({\mathbf {w}_{*}}\) which is generally dense. We thus need to restrict the search to a family of sparser models. Our second observation is that a well-known, computationally attractive and principled way to induce sparsity is \(\ell _1\)-norm regularization, i.e., constraining \(\mathbf {u}\) to lie in a ball around 0 with respect to the norm \(\Vert \mathbf {u}\Vert _{\ell _1} = \sum \nolimits _i |u_i|\). Thus, we approach the task of sparsifying the SVM by solving a problem of the form

$$\begin{aligned} \underset{\varvec{\alpha }\in \mathbb {R}^{n_{\mathrm {\tiny sv}}}}{\text{ min }} \tfrac{1}{2} \Vert \phi (\mathbf {S})\varvec{\alpha }- \mathbf {w}_{*}\Vert ^2 \ \text{ s.t. } \ \Vert \varvec{\alpha }\Vert _{\ell _1} \le \delta , \end{aligned}$$
(4)

where \(\delta \) is a regularization parameter controlling the level of sparsification. The obtained problem can be easily recognized as a kernelized Lasso with response variable \(\mathbf {w}_{*}\) and design matrix \(\phi (\mathbf {S})\). By observing that

$$\begin{aligned} \begin{aligned} \Vert \mathbf {w}_{*} - \phi (\mathbf {S})\varvec{\alpha }\Vert ^2&= \mathbf {w}_{*}^T\mathbf {w}_{*} \!- 2 \varvec{\alpha }_{*}^T\phi (\mathbf {S})^T\phi (\mathbf {S})\varvec{\alpha }+ \varvec{\alpha }^T\phi (\mathbf {S})^T\phi (\mathbf {S})\varvec{\alpha }\\&= \mathbf {w}_{*}^T\mathbf {w}_{*} - 2 \varvec{\alpha }_{*}^T\mathbf {K}\varvec{\alpha }+ \varvec{\alpha }^T\mathbf {K}\varvec{\alpha }= \mathbf {w}_{*}^T\mathbf {w}_{*} - 2 \mathbf {c}^T\varvec{\alpha }+ \varvec{\alpha }^T\mathbf {K}\varvec{\alpha }, \end{aligned} \end{aligned}$$
(5)

where \(\mathbf {c}=\mathbf {K}\varvec{\alpha }_{*}\), it is easy to see that solving (4) only requires access to the kernel matrix (or the kernel function):

$$\begin{aligned} \underset{\varvec{\alpha }\in \mathbb {R}^{{n_{\mathrm {\tiny sv}}}}}{\text{ min }} \ g(\varvec{\alpha }) = \tfrac{1}{2} \varvec{\alpha }^T\mathbf {K}\varvec{\alpha }- \mathbf {c}^T\varvec{\alpha }\ \text{ s.t. } \ \Vert \varvec{\alpha }\Vert _{\ell _1} \le r . \end{aligned}$$
(6)

This type of approach has been considered, up to some minor differences, by Schölkopf et al. in [12]. However, to the best of our knowledge, it has been largely left out of the recent literature on sparse approximation of kernel models. One possible reason for this is the fact that the original proposal had a high computational cost, making it unattractive for large models. We reconsider this technique arguing that recent advancements in Lasso optimization make it possible to solve the problem efficiently using high-performance algorithms with strong theoretical guarantees [4]. Importantly, we show in Sect. 5 that this efficiency is not obtained at the expense of accuracy, and indeed the method can match or even surpass the performance of the current state-of-the-art methods.

Algorithm. To solve the kernelized Lasso problem, we adopt a variant of the Frank-Wolfe (FW) method [5], an iterative greedy algorithm to minimize a convex differentiable function \(g(\varvec{\alpha })\) over a closed convex set \(\varSigma \), specially tailored to handle large-scale instances of (6). This method does not require to compute the matrix \(\mathbf {K}\) beforehand, is very efficient in practice and enjoys important convergence guarantees [5, 9], some of which are summarized in Theorem 1. Given an iterate \({\varvec{\alpha }}^{(k)}\), a step of FW consists in finding a descent direction as

$$\begin{aligned} \mathbf {u}^{(k)} \in \mathop {{{\mathrm{argmin}}}}\limits _{\mathbf {u}\,\in \, \varSigma } \, (\mathbf {u}-\varvec{\alpha }^{(k)})^T \nabla g({\varvec{\alpha }}^{(k)}), \end{aligned}$$
(7)

and updating the current iterate as \({\varvec{\alpha }}^{(k+1)}= (1-\lambda ^{(k)}){\varvec{\alpha }}^{(k)} + \lambda ^{(k)} {\mathbf {u}}^{(k)}\). The step-size \(\lambda ^{(k)}\) can be determined by an exact line-search (which can be done analytically for quadratic objectives) or setting it as \(\lambda ^{(k)} = 1/(k+2)\) as in [5].

In the case of problem (6), where \(\varSigma \) corresponds to the \(\ell _1\)-ball of radius t in \(\mathbb {R}^{n_{\mathrm {\tiny sv}}}\) (with vertices \(\mathcal {V}=\{\pm t \mathbf {e}_i: i= 1,2,\ldots ,{n_{\mathrm {\tiny sv}}}\}\)) and the gradient is \(\nabla g(\varvec{\alpha }) = \mathbf {K}\varvec{\alpha }- \mathbf {c}\), it is easy to see that the solution of (7) is equivalent to

$$\begin{aligned} j*&=\arg \max _{j \in [{n_{\mathrm {\tiny sv}}}]} \left| \phi (\mathbf {s}_j)^T\phi (\mathbf {S})\varvec{\alpha }+ c_j \right| = \arg \max _{j \in [{n_{\mathrm {\tiny sv}}}]} \left| \sum \nolimits _{i: \alpha _i \ne 0} \alpha _i K_{ij} + c_j \right| . \end{aligned}$$
(8)

The adaptation of the FW algorithm to problem (6) is summarized in Algorithm 1 and is referred to as SASSO in the rest of this paper.

figure a

Theorem 1

Consider problem (6) with \(r \in (0, \Vert \varvec{\alpha }^{*}\Vert _{\ell _1})\). Algorithm 1 is monotone and globally convergent. In addition, there exists \(C>0\) such that

$$\begin{aligned} \Vert \mathbf {w}_{*} - \phi (\mathbf {S}){\varvec{\alpha }}^{(k)}\Vert ^2 - \Vert \mathbf {w}_{*} - \phi (\mathbf {S}){\varvec{\alpha }}^{(k+1)}\Vert ^2 \le {C}/{(k+2)}. \end{aligned}$$
(9)

Tuning of b . We have assumed above that the bias b of the SVM can be preserved in the approximation. A slight boost in accuracy can be obtained by computing a value of b which accounts for the change in the composition of the support set. For the sake of simplicity, we adopt here a method based on a validation set, i.e., we define a range of possible values for b and then choose the value minimizing the misclassification loss on that set. It can be shown that it is safe (in terms of accuracy) to restrict the search to the interval where

Fig. 1.
figure 1

Test accuracy (y axis) versus sparsity (number of support vectors, x axis). From top-left to bottom-right: Adult, IJCNN, TIMIT and MNIST datasets.

4 Experimental Results

We present experiments on four datasets recently used in [2, 3] to assess SVM sparsification methods: Adult (a8a), IJCNN, TIMIT and MNIST. Table 1 summarizes the number of training points m and test points t for each dataset. The SVMs to sparsify were trained using SMO with a RBF kernel and parameters set up as in [2, 3]. As discussed in Sect. 2, we compare the performance of our algorithm with that of the ISSVM algorithm, which has a publicly available C++ implementation [3]. Our algorithms have been also coded in C++. We executed the experiments on a 2 GHz Intel Xeon E5405 CPU with 20 GB of main memory running CentOS, without exploiting multithreading or parallelism in computations. The code, the data and instructions to reproduce the experiments of this paper are publicly available at https://github.com/maliq/FW-SASSO.

We test two versions of our method, the standard one in Algorithm 1, and an aggressive variant employing a fully corrective FW solver (where an internal optimization over the current active set is carried out at each iteration, see e.g. [5]). The baseline comes also in two versions. The “basic” version has two parameters, namely \(\ell _2\)-norm and \(\eta \): the first controls the level of sparsity, and the second is the learning rate used for training. The “aggressive” version has an additional tolerance parameter \(\epsilon \) (see [3] for details). To choose values for these parameters, we reproduced the methodology employed in [3], i.e., for the learning rate we tried values \(\eta =4^{-4},\ldots ,4^{2}\) and for \(\epsilon \) (in the aggressive variant) we tried values \(\epsilon =2^{-4},\ldots ,1\). For each level of sparsity, we choose a value based on a validation set. This procedure was repeated over 10 test/validation splits.

Table 1. Time required to build the sparsity/accuracy path. We report the total time incurred in parameter selection (training with different parameters and evaluation in the validation set) and the average training time to build a path.

Following previous work [2, 3], we assess the algorithms on the entire sparsity/accuracy path, i.e., we produce solutions with decreasing levels of sparsity (increasing number of support vectors) and evaluate their performance on the test set. For ISSVM, this is achieved using different values of the \(\ell _2\)-norm parameter. During the execution of these experiments, we observed that it is quite difficult to determine an appropriate range of values for this parameter. Our criterion was to set this range manually till obtaining the range of sparsities reported in the figures of [3]. For SASSO, the level of sparsity is controlled by parameter \(\delta \) in (4). The maximum value for \(\delta \) is easily determined as the \(\ell _1\)-norm of the input SVM and the minimum value as \(10^{-4}\) times the former. To make comparison fair, we compute 10 points of the path for all the methods.

Results in Fig. 1 show that the sparsity/accuracy tradeoff path obtained by SASSO matches that of the (theoretically optimal) ISSVM method [3], and often tends to outperform it on the sparsest section of the path. However as it can be seen from Table 1, our method enjoys a considerable computational advantage over ISSVM: on average, it is faster by 1–2 orders of magnitude, and the overhead due to parameter selection is marginal compared to the case of ISSVM, where the total time is one order of magnitude larger than the single model training time. We also note that the aggressive variant of SASSO enjoys a small but consistent advantage on all the considered datasets. Both versions of our method exhibit a very stable and predictable performance, while ISSVM needs the more aggressive variant of the algorithm to produce a regular path. However, this version requires considerable parameter tuning to achieve a behavior similar to that observed for SASSO, which translates into a considerably longer running time.

5 Conclusions

We presented an efficient method to compute sparse approximations of non-linear SVMs, i.e. to reduce the number of support vectors in the model. The algorithm enjoys strong convergence guarantees and it is easy to implement in practice. Further algorithmic improvements could also be obtained by implementing the stochastic acceleration studied in [4]. Our experiments showed that the proposed method is competitive with the state of the art in terms of accuracy, with a small but systematic advantage when sparser models are required. In computational terms, our approach is significantly more efficient due to the properties of the optimization algorithm and the avoidance of cumbersome parameter tuning.