Finding Small Sets of Random Fourier Features for Shift-Invariant Kernel Approximation

Schleif, Frank-M.; Kaban, Ata; Tino, Peter

doi:10.1007/978-3-319-46182-3_4

Frank-M. Schleif^17,18,19,
Ata Kaban¹⁹ &
Peter Tino¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9896))

Included in the following conference series:

IAPR Workshop on Artificial Neural Networks in Pattern Recognition

2285 Accesses
1 Citations

Abstract

Kernel based learning is very popular in machine learning, but many classical methods have at least quadratic runtime complexity. Random fourier features are very effective to approximate shift-invariant kernels by an explicit kernel expansion. This permits to use efficient linear models with much lower runtime complexity. As one key approach to kernelize algorithms with linear models they are successfully used in different methods. However, the number of features needed to approximate the kernel is in general still quite large with substantial memory and runtime costs. Here, we propose a simple test to identify a small set of random fourier features with linear costs, substantially reducing the number of generated features for low rank kernel matrices, while widely keeping the same representation accuracy. We also provide generalization bounds for the proposed approach.

You have full access to this open access chapter, Download conference paper PDF

Pre-image Calculation for Random Fourier Feature Kernel Machines

Major advancements in kernel function approximation

Article 01 August 2020

Randomized Nyström Features for Fast Regression: An Error Analysis

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Kernel based learning methods are very popular in various machine learning tasks like regression, classification or clustering [1–5]. The operations, used to calculate the respective models, typically evaluate the full kernel matrix, leading to quadratic or even cubic complexity. As a consequence, the approximation of positive semi-definite (psd) kernels has raised wide interest [6, 7]. Most approaches focus on approximating the kernel by the (clustered) Nyström approximation or specific variations of the singular value decompositions [8, 9]. A recent approach effectively combining multiple strategies was presented in [6]. Random fourier features (RFF) have been introduced in [10] to the field of kernel based learning. The aim is to approximate shift invariant kernels by mapping the input data into a randomized feature space and then apply existing fast linear methods [11–13]. This is of special interest if the number of samples N is very large and the obtained kernel matrix $K \in \mathbb {R}^{N \times N}$ leads to high storage and calculation costs. The features are constructed so that the inner products of the transformed data are approximately equal to those in the feature space of a user specified shift-invariant kernel:

$$\begin{aligned} k(\mathbf {x},\mathbf {y}) = \langle \phi (\mathbf {x}), \phi (\mathbf {y}) \rangle \approx z(\mathbf {x})^\prime z(\mathbf {y}) \end{aligned}$$

(1)

With $\phi : \mathcal {X} \mapsto \mathcal {H}$ being a non-linear mapping of patterns from the original input space $\mathcal {X}$ to a high-dimensional space $\mathcal {H}$. The mapping function is in general not given in an explicit form. Unlike the kernel lifting by using $\phi (\cdot )$, z is a (comparable) low dimensional feature vector. In [14] it was empirically shown that for data with a large eigenvalue gap random fourier features are less efficient than a standard Nyström approximation. However, the authors used only a rather small data independent set of fourier features. Here, we propose a selection strategy which not only reduces the number of necessary random fourier features but also helps to select a reasonable set of features, which provides a good approximation of the original kernel function. In this line our focus is less on best possible approximation accuracy but rather on saving memory and obtaining compact representations to address the usage of random fourier features in low resource environments. In [10] it is shown how random fourier feature vectors z can be constructed for various shift invariant kernels $k(\mathbf {x}-\mathbf {y})$, e.g. the RBF kernel upto an error using only $D = \mathcal {O}(d \epsilon ^{-2} \log \frac{1}{\epsilon ^2} )$ dimensions, where d is the input dimension of the original input data. Assuming that $d = 10$ and $\epsilon = 0.01$ one gets $ D \approx 100.000$.

However, in [10] it is empirically shown that the approximation is already reasonable enough for smaller $D \approx $ 500–5000. While very efficient in general there is not yet a reasonable strategy how to choose an appropriate number D nor which features have to be generated in a more systematic way. If we assume that the images of training inputs in the feature space given (implicitly) by the kernel lie in an intrinsically low dimensional space one can expect that a much smaller number of features should be sufficient to describe the data. A reasonable strategy to test for a reliable number D of random fourier features is to compare the approximated kernel using Eq. (1) with the true kernel K based on the original data using some appropriate measure. This however is in general not possible or very costly for larger N because one would need to generate two $N \times N$ kernel matrices. We suggest to use a constructive approach, generating as many features as necessary to obtain a low reconstruction error between the two kernel matrices. Our approach is very generic as we do not focus on dedicated cost functions used in (semi-)supervised classification nor clustering or embedding measures, such that the constructed feature set provides a reasonable approximation of the shift-invariant kernel matrix. Standard feature reduction techniques for high dimensional data sets like random projection [15, 16], unsupervised feature selection techniques based on statistical measures [17] or supervised approaches [18, 19] are not suitable because they start from a high-dimensional feature space or are specific to the underlying cost-function. In [1] random fourier features were used in combination with a singular value decomposition to reduce the number of features after generating the random fourier features, again with in general rather large initial D. To avoid high costs in the construction procedure we employ the Nyström approximation at different points to evaluate the accuracy of the constructed random fourier feature set using the Frobenius norm. We assume that the considered kernel is in fact intrinsically low dimensional. The paper is organized as follows, first we review the main theory for random Fourier features, subsequently we detail our approach of finding small sets of random fourier features, still sufficiently accurate to approximate the kernel matrix using the Frobenius norm. In a subsection we review the Nyström approximation and derive a linear time calculation of the Frobenius norm for the difference of two Nyström approximated matrices. Later we derive error bounds for the presented approach and show the efficiency of our method on various standard datasets employing two state of the art linear time classifiers for vectorial input data.

2 Random Fourier Features

Random fourier features as introduced in [10], project the vectorial data points onto a randomly chosen line, and then pass the resulting scalar through a sinusoid. The random lines are drawn from a distribution so as to guarantee that the inner product of two transformed points approximates the desired shift-invariant kernel. The motivation for this approach is given by Bochners theorem:

Definition 1

A continuous kernel $k(\mathbf {x},\mathbf {y}) = k(\mathbf {x}-\mathbf {y})$ on $\mathbb {R}^d$ is positive definite if and only if $k(\mathbf {x}-\mathbf {y})$ is the Fourier transform of a non-negative measure.

If the kernel $k(\mathbf {x}-\mathbf {y})$ is properly scaled, Bochners theorem guarantees that its Fourier transform $p(\omega )$ is a proper probability distribution. The idea in [10] is to approximate the kernel as

$$ k(\mathbf {x}-\mathbf {y}) = \int _{\mathbb {R}^d} p(\omega ) e^{j \omega (\mathbf {x}-\mathbf {y})} d \omega $$

with some extra normalizations and simplifications one can sample the features for k using the mapping $z_\omega (\mathbf {x}) = [\cos (\mathbf {x}) \sin (\mathbf {x})]$. In [10] the authors also give a proof for the uniform convergence of Fourier features to the kernel $k(\mathbf {x}-\mathbf {y})$. A detailed derivation can be found in [10].

To generate the random fourier features one eventually needs a psd kernel matrix $k(\mathbf {x},\mathbf { y}) = k(\mathbf {x}-\mathbf {y})$ and a random feature map $z(\mathbf {x}): \mathbb {R}^d\rightarrow \mathbb {R}^{2D}$ s.t. $z(\mathbf {x})^\top z(\mathbf {y}) \approx k(\mathbf {x}-\mathbf {y})$. One draws D i.i.d. samples $\{\omega _1,\ldots ,\omega _D \}\in \mathbb {R}^d$ from $p(\omega )$ and generates $z(\mathbf {x}) = \sqrt{1/D}[\cos (\omega _1^\top \mathbf {x}) \ldots \cos (\omega _D^\top \mathbf {x})\; \sin (\omega _1^\top \mathbf {x}) \ldots \sin (\omega _D^\top \mathbf {x})]^\top $.

3 Finding Small Sets of Random Fourier Features

To incrementally add fourier features to the approximation of kernel k we use the Frobenius norm to calculate the difference between the two kernels. For real valued data the Frobenius norm of two squared matrices is simply the sum of the squared difference between the individual kernel entries:

$$\begin{aligned} \left\| {\hat{k} - k} \right\| _{F} = \sqrt{\sum _i^N \sum _j^N (\hat{k}(\mathbf {x}_i - \mathbf {y}_j) - k(\mathbf {x}_i -\mathbf {y}_j))^2} \end{aligned}$$

(2)

This has $\mathcal {O}(N^2)$ costs in memory and runtime and we would need to generate the full kernel $\hat{k}$ and k. To avoid these costs we use the Nyström approximation for kernel matrices [20] to approximate both kernels by using only $\mathcal {O}(N)$ coefficients and provide a formulation for calculating the Frobenius norm of the difference of two Nyström approximated matrices. The Nyström approximation of the original kernel matrix NyK (detailed in the next sub-section) can be done once prior to calculations of the random fourier features. Subsequently the approximated kernel is constructed by iteratively adding those n random fourier features which significantly improve the Frobenius error in Eq. (2) with $ \epsilon = 1e^{-3}$. This iterative procedure is continued until either no further significant improvement was found for a number iterMax = 5 of random selections or an upper limit of features $D_{max}$ is obtained. The detailed procedure is given in Algorithm 1.

4 Nyström Approximated Matrix Processing

The Nyström approximation technique has been proposed in the context of kernel methods in [20]. One well known way to approximate a $N \times N$ Gram matrix, is to use a low-rank approximation. This can be done by computing the eigendecomposition of the kernel matrix $ {K} = {U} {\varLambda } {U}^T, $ where U is a matrix, whose columns are orthonormal eigenvectors, and ${\varLambda }$ is a diagonal matrix consisting of eigenvalues ${\varLambda }_{11} \ge {\varLambda }_{22} \ge ... \ge 0$, and keeping only the m eigenspaces which correspond to the m largest eigenvalues of the matrix. The approximation is $ {\tilde{K}} \approx {U}_{(N,m)} {\varLambda }_{(m,m)} {U}_{(m,N)}, $ where the indices refer to the size of the corresponding submatrix restricted to the larges m eigenvalues. The Nyström method approximates a kernel in a similar way, without computing the eigendecomposition of the whole matrix, which is an $O(N^3)$ operation.

By the Mercer theorem, kernels $k(\mathbf {x},\mathbf {x^\prime })$ can be expanded by orthonormal eigenfunctions $\varphi _i$ and non negative eigenvalues $\lambda _i$ in the form

$$ k(\mathbf {x},\mathbf {x^\prime })=\sum _{i=1}^\infty \lambda _i \varphi _i (\mathbf {x}) \varphi _i (\mathbf {x^\prime }). $$

The eigenfunctions and eigenvalues of a kernel are defined as solutions of the integral equation

$$ \int k(\mathbf {x^\prime },\mathbf {x}) \varphi _i (\mathbf {x}) p (\mathbf {x}) d\mathbf {x} = \lambda _i \varphi _i (\mathbf {x^\prime }), $$

where $p(\mathbf {x})$ is a probability density over the input space. This integral can be approximated based on the Nyström technique by an i.i.d. sample $\{\mathbf {x}_k\}_{k=1}^m$ from $p(\mathbf {x})$:

$$\begin{aligned} \frac{1}{m} \sum _{k=1}^m k(\mathbf {x^\prime },\mathbf {x}_k) \varphi _i (\mathbf {x}_k) \approx \lambda _i \varphi _i (\mathbf {x^\prime }). \end{aligned}$$

(3)

Using this approximation we denote with ${K}^{(m)}$ the corresponding $m \times m$ Gram sub-matrix and get the corresponding matrix eigenproblem equation as:

$$ \frac{1}{m} {K}^{(m)} {U}^{(m)} = {U}^{(m)} {\varLambda }^{(m)} $$

with $ {U}^{(m)} \in \mathbb {R}^{m \times m}$ is column orthonormal and $ {\varLambda }^{(m)}$ is a diagonal matrix.

Now we can derive the approximations for the eigenfunctions and eigenvalues of the kernel k

$$\begin{aligned} \lambda _i \approx \frac{\lambda _i^{(m)} \cdot N}{m}, \quad \varphi _i (\mathbf {x^\prime }) \approx \frac{\sqrt{m/N}}{ \lambda _i^{(m)}} \mathbf {k}_x^{\prime ,\top } \mathbf {u}_i^{(m)}, \end{aligned}$$

(4)

where $\mathbf {u}_i^{(m)}$ is the ith column of ${U}^{(m)}$. Thus, we can approximate $\varphi _i$ at an arbitrary point $\mathbf {x^\prime }$ as long as we know the vector $ \mathbf {k}_x^\prime = (k(\mathbf {x}_1,\mathbf {x^\prime }), ... , k(\mathbf {x}_m,\mathbf {x^\prime })). $ For a given $N \times N$ Gram matrix K one may randomly choose m rows and respective columns. The corresponding indices are called landmarks, and should be chosen such that the data distribution is sufficiently covered. Strategies how to chose the landmarks have recently been addressed in [8, 21] and [22, 23]. We denote these rows by ${K}_{(m,N)}$. Using the formulas Eq. (4) we can reconstruct the original kernal matrix,

$$ \tilde{{K}} = \sum _{i=1}^m 1/\lambda _i^{(m)}\cdot {K}_{(m,N)}^T (\mathbf {u}_i^{(m)})^T ( \mathbf {u}_i^{(m)} ) {K}_{(m,N)}, $$

where $\lambda _i^{(m)}$ and $\mathbf {u}_i^{(m)}$ correspond to the $m \times m$ eigenproblem (3). Thus we get the approximation,

$$\begin{aligned} \tilde{{K}}={K}_{(N,m)} {K}^{-}_{(m,m)} {K}_{(m,N)}. \end{aligned}$$

(5)

This approximation is exact, if ${K}_{(m,m)}$ has the same rank as K.

Nyström Approximation Based Frobenius Norm. Instead of the Frobenius norm definition given in Eq. (2) we will use an equivalent formulation based on the trace of the matrix:

$$\begin{aligned} \left\| {\hat{k} - k} \right\| _{F} = \sqrt{\sum _{i=1}^N \ddot{k}(\mathbf {x}_i,\mathbf {y}_i)} \end{aligned}$$

(6)

$\ddot{k}(\mathbf {x}_i,\mathbf {y}_i)$ is given by the (i, j)’th entry of the matrix $\ddot{K}$ defined as $\ddot{K} = (\hat{K} - K) \cdot (\hat{K}- K)^\top $ (in matrix notation). This formulation is useful because we can obtain the diagonal elements of a Nyström approximated matrix very easy.

We approximate $\hat{k}(\mathbf {x}_i,\mathbf {y}_j)$ and $k(\mathbf {x}_i,\mathbf {y}_j)$ using the Nyström approximation and obtain matrices $\hat{K}_{(nm)}, \hat{K}^{-1}_{(nm)}$ and ${K}_{(nm)}, {K}^{-1}_{(nm)}$ as defined before. With some basic algebraic operations one gets the following equation for the Frobenius norm of the difference of two Nyström approximated matrices (in matrix notation). Let $C = \hat{K}_{(nm)} \otimes K_{(nm)}$ and $W = C^{-1}_{(m,m)}$. Further we introduce matrices $\hat{C}$ with entries $\hat{C}_{[i,j]} = C^2_{[i,j]}, \hat{W} = \hat{C}^{-1}_{(m,m)}$ and $C^\prime $ with entries $C^\prime _{[i,j]}=K^2_{[i,j]}, W^\prime = K^{-1}_{(m,m)}$. Then the approximated Frobenius norm can be derived as:

$$\begin{aligned} \left\| {\hat{k} - k} \right\| _{F} = \sqrt{\sum (\sum (\hat{C} \cdot \hat{W})) \cdot \hat{C}^\top + (\sum ({C}^\prime \cdot {W}^\prime )) \cdot {C}^{\prime ,\top } - 2 \cdot (\sum (C \cdot W)) \cdot C^\top } \end{aligned}$$

(7)

This operation can be done with linear costs.

Complexity and Error Analysis. The Nyström approximation of k can be calculated once prior to random fourier feature selection, with costs of $\mathcal {O}(N \times m + m^3)$, to obtain the two submatrices of the Nyström approximation. If we assume that $m \ll N$ this is summarized by $\mathcal {O}(N \times m)$. The Nyström approximation of $\hat{k}$ needs to be calculated in each enhancing step of the feature construction. If we add one feature per iteration the costs of $m^3$ can be avoided by use of the matrix inversion lemma. If we assume that in each step 1 feature is added. If we restrict the number of added features by $D_{max}$ extra costs of $\mathcal {O}(D_{max} \times N \times m)$ are present to calculate the Nyström approximation of k. If we assume that $D_{max} \ll N$ this is again reduced to $\mathcal {O}(N \times m)$.

The calculation of the Frobenius norm can be done in linear time using Eq. (7). Hence, we finally have costs of $\mathcal {O}(N \times m)$ for generating the random fourier features. The number of effectively chosen random fourier features is in general much smaller then $D_{max}$. In the following we use $m = 50$ and $D_{max} = 5000$ and report crossvalidation accuracies using the approximated $\hat{k}$ in comparison to k for different classification tasks.

As mentioned before and shown from the runtime analysis the approach is reasonable only if the number of landmarks m is low with respect to N, or the intrinsic dimensionality of the datasets is low, respectively. Taking e.g. an RBF kernel, the $\sigma $ parameter controls the width of the Gaussian. If the RBF kernel is employed in a kernel classifier we observe that for very small $\sigma $ a Nearest Neighbor approach is approximated and the intrinsic dimensionality of the data or number of non-vanishing eigenvalues gets large. In these cases the RBF representation can not be approximated without a corresponding high loss in the prediction accuracy of the model and our approach can not be used. In the proposed procedure we observe two approximation errors, namely the error introduced by the random fourier feature approximation and the error introduced by the Nyström approximation. We have:

$$\begin{aligned} \left\| {\hat{K} - K} \right\| _{F} = \sqrt{\sum _{i=1}^N \sum _{j=1}^N | \hat{K}(\mathbf {x}_i, \mathbf {x}_j) - K(\mathbf {x}_i,\mathbf {x}_j)|^2} \end{aligned}$$

(8)

where $\hat{K}$ is the Nyström approximated kernel matrix of the kernel matrix obtained from the random fourier features of the training data. By the triangle inequality we get

$$\begin{aligned} \left\| {\hat{K} - K} \right\| _{F} \le \left\| {\varphi ^{{ \top }} \varphi - K} \right\| _{F} + \left\| {\varphi ^{{ \top }} \varphi - \tilde{K}} \right\| _{F} \end{aligned}$$

(9)

where $\tilde{K}$ is the Nyström approximated kernel matrix of the linear kernel matrix on the random fourier features and $\varphi ^\top \varphi $ is given as

$$\begin{aligned} \varphi (\mathbf {x}_i)^\top \varphi (\mathbf {x}_j) = \frac{\alpha }{D} \sum _{l=1}^D \cos (\omega _l^\top (\mathbf {x}_i - \mathbf {x}_j)) \end{aligned}$$

(10)

where D is the number of random fourier features $\omega _l \sim N(0,I_d), \alpha >0$ and $\mathbf {x}_i,\mathbf {x}_j$ are training points with RFF feature values stored in $\varphi $. In the following we derive and combine the bounds for both approximation schemes. The Frobenius error of the approximated kernel using the random fourier features is given as

$$\begin{aligned} \left\| {\varphi ^{{ \top }} \varphi - K} \right\| _{F} = \sqrt{\sum _{i=1}^N \sum _{j=1}^N | \varphi (\mathbf {x}_i)^\top \varphi (\mathbf {x}_j) - K(\mathbf {x}_i,\mathbf {x}_j)|^2} \end{aligned}$$

(11)

with K as the kernel matrix of the training points $\{\mathbf { x}_1, \ldots ,\mathbf { x}_N\}$. For a fixed pair of points $(\mathbf {x}_i,\mathbf {x}_j)$ we have:

$$\begin{aligned}&Pr\{|\varphi (\mathbf {x}_i)^\top \varphi (\mathbf {x}_j) - \underbrace{E[\varphi (\mathbf {x}_i)^\top \varphi (\mathbf {x}_j)]}_{K(\mathbf {x}_i,\mathbf {x}_j)}|> t\} = \nonumber \\= & {} Pr \{ | \frac{\alpha }{D} \sum _{l=1}^D \cos (\omega _l^\top (\mathbf {x}_i-\mathbf {x}_j)) - K(\mathbf {x}_i,\mathbf {x}_j) | > t \} \end{aligned}$$

(12)

$$\begin{aligned}\le & {} 2 \exp \left\{ \frac{-t^2 D}{2 \alpha ^2} \right\} \end{aligned}$$

(13)

the last inequation follows by the Höffding inequality because the terms $\cos (\omega _l^\top (\mathbf {x}_i , \mathbf {x}_j ))$ are independent w.r.t. $\{ \omega _1, \ldots , \omega _D \}$ and are bounded $\in [-1, 1]$. The above condition can be generalized asymptotically to all pairs from $\{\mathbf {x}_1 ,\ldots ,\mathbf { x}_N \}$. Hence the following holds simultaneously for all pairs:

$$\begin{aligned} Pr \{ \exists (i,j) : | \varphi (\mathbf {x}_i)^\top \varphi (\mathbf {x}_j) - K(\mathbf {x}_i,\mathbf {x}_j) | > t \} \le 2 N^2 \exp \left\{ \frac{-t^2 D}{2\alpha ^2} \right\} \end{aligned}$$

(14)

by union bound. Hence we obtain

$$\begin{aligned} \left\| {\varphi ^{{ \top }} \varphi - K} \right\| _{F} \le \sqrt{N^2 t^2} = N \cdot t \end{aligned}$$

(15)

with probability of at least $1-2 N^2 \exp \left\{ \frac{-t^2 D}{2\alpha ^2} \right\} $. To get the failure probability to an arbitrary small $\delta $:

$$\begin{aligned} 2 N^2 \exp \left\{ \frac{-t^2 D}{2\alpha ^2} \right\}= & {} \delta \\ \frac{t^2 D}{2\alpha ^2}= & {} \log \frac{2 N^2}{\delta } = \log \frac{2}{\delta } + 2 \log N \\ t= & {} \alpha \sqrt{\frac{2}{D}\left( \log \frac{2}{\delta } + 2 \log N \right) } \end{aligned}$$

We get

$$\begin{aligned} \left\| {\varphi ^{{ \top }} \varphi - K} \right\| _{F} \le N \cdot \alpha \sqrt{ \frac{2}{D} \left( \log \frac{2}{\delta } + 2 \log N \right) } = \tilde{\mathcal O}\left( \frac{N}{\sqrt{D}} \right) \end{aligned}$$

(16)

We can ensure that $ \left\| {\varphi ^{{ \top }} \varphi - K} \right\| _{F} \le \epsilon $ by choosing D large enough:

$$\begin{aligned} N \cdot \alpha \sqrt{\frac{2}{D} \left( \log \frac{2}{\delta } + 2 \log N \right) } \le \epsilon \\ \frac{2}{D} \left( \log \frac{2}{\delta } + 2 \log N \right) \le \frac{\epsilon ^2}{N^2 \alpha ^2} \rightarrow \frac{2}{D} \le \frac{\epsilon ^2}{ \left( \log \frac{2}{\delta } + 2 \log N \right) N^2 \alpha ^2} \\ D \ge \frac{2 N^2 \alpha ^2 \left( \log \frac{2}{\delta }+2\log N \right) }{\epsilon ^2} =\tilde{\mathcal O} \left( \frac{N}{\epsilon } \right) \end{aligned}$$

For the second approximation error we bound the error of inner product of the random fourier feature vectors obtained from the training data with respect to a Nyström approximation of the kernel based on the random fourier features. This is just a classical Nyström approximation of a kernel matrix. Hence we can use bounds already provided in [24]. According to Theorem 2 given in [24] the following inequality holds with probability of at least $1-\delta $:

$$\begin{aligned} \left\| {\varphi ^{{ \top }} \varphi - \hat{K}} \right\| _{F} \le \left\| {\varphi ^{{ \top }} \varphi - K_{k} } \right\| _{F} + \left[ \frac{64D}{m} \right] ^{\frac{1}{4}} N K_{max} \left[ 1+ \sqrt{\frac{N-m}{n-1/2} \frac{1}{\beta (m,N)} \log \frac{1}{\delta } d^K_{max} / K_{max}^{\frac{1}{2}}} \right] ^{\frac{1}{2}} \end{aligned}$$

(17)

with $\beta (m,n) = 1 - \frac{1}{2 \max m,N-m}$ and $K_k$ the best k approximation of K and $K_{max}$ the maximal diagonal entry of K and $d^K_{max}$ the maximum Euclidean distance defined over K. Which maybe summarized in accordance to [22] as $ \left\| {\varphi ^{{ \top }} \varphi - K_{k} } \right\| _{F} + [\frac{D}{m}]^{{\frac{1}{4}}} N\left\| K \right\| _{2} $. Combining both bounds we get

$$\begin{aligned} \left\| {\hat{K} - K} \right\| _{F}\le & {} \left\| {\varphi ^{{ \top }} \varphi - K} \right\| _{F} + \left\| {\varphi ^{{ \top }} \varphi - \tilde{K}} \right\| _{F} \nonumber \\\le & {} \left\| {\varphi ^{{ \top }} \varphi - K_{k} } \right\| _{F} + \nonumber \\&N\cdot \left( {\alpha \sqrt{\frac{2}{D}\left( {\log \frac{2}{\delta } + 2\log N} \right) } + [\frac{D}{m}]^{{\frac{1}{4}}} \left\| K \right\| _{2} } \right) \end{aligned}$$

(18)

We see that both approximation terms increase as $\tilde{\mathcal O}(N)$ - that is, up to log factors the kernel approximation error increases linearly with the number of training points N. This was expected since the gram matrix K has size $N \times N$. We may also notice a tradeoff for the value of D: The random Fourier feature approximation bound tightens as k increases whereas the Nyström approximation loosens. (One could possibly use the value of D that minimizes the approximation error bound, although we have not tried this in the experiments.) The approximation error bound presented here is uniform only over the training points - which was much simpler to achieve than a bound that holds uniformly over the whole input domain - as in the original paper [10]. Nevertheless we can still expect it to be informative since for kernel based learning there exist generalization bounds whose complexity term only depends on the gram matrix constructed from the training set (e.g. the Rademacher complexity for kernel based linear classification works out as the trace of the gram matrix).

5 Experiments

We evaluate the approach on multiple public datasets most of them already used in the paper [10]^{Footnote 1}. For the Nyström approximation step we use 50 landmarks. The checkerboard data (checker) is a two class problem consisting of 9000 2d samples organized like a checkerboard on a $3 \times 3$ grid. The data are separable with low error using an rbf kernel. The coil-20 dataset (coil) consists of 1440 image files in 16384 dimensions from the coil database categorized in 20 classes. The spam database consists of 4601 samples in 57 dimensions in two classes. The adult dataset consists of 30162 samples in 44 dimensions given in 2 classes^{Footnote 2}. The code-rna dataset with 59535 samples and 8 dimensions from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. Two further used datasets are the famous USPS data with 11000 samples in 256 dimensions and the MNIST data with 60000 samples and 256 dimensions. Both datasets are organized in 10 classes originating from a character recognition problem. Finally the covertype data with 495141 entries (classes 1 and 2) and 54 dimensions and the Forest data with 522.000 samples and 54 dimensions both taken from the UCI database were analyzed. All datasets have been z-transformed. For the $\sigma $ parameter of the rbf kernel we use values reported before elsewhere. To evaluate the classification performance we follow [10] and use a least squares regression (LS) model as well as the liblinear, which is a high performance linear Support Vector Machine^{Footnote 3}. The parameter C of the Liblinear-SVM was fixed to 1 as suggested by the liblinear authors. Multiclass problems have been approached in LS using a one vs rest scheme. In Table 1 we report 10-fold crossvalidation results and the minimal number of features D* as obtained by the proposed strategy. The maximal number of random fourier features D per dataset is in general 5000 as suggested in [10] with exceptions for the larger datasets to keep memory consumption tractable only 500–1000 features where chosen.

Table 1. Test set accuracy (% ± std) of the various benchmark datasets for constructing small sets of random fourier features. The second row of each dataset contains the mean cputime of a single cycle in the crossvalidation.

Full size table

For the coil data we see that the identified small RFF-model contains 29−times less features than the full model while loosing $\approx $2 % discrimination accuracy on the test set. For the spam database we observe a similar result with 27−times less features and a small decay in the accuracy of $3\,\%$ for SVM. At the simulated checkerboard data we have almost the same accuracy in the reduced set while the number of features is reduced by a factor of 250. For the USPS data we have $\approx $6 times less features with almost the same prediction accuracy, slightly reduced by 1–2 %. The Adult dataset keeps almost the same accuracy while having 5-times less features similar observations can be made for the code-dna data. For MNIST the accuracy drops by 7–8 % with 21 times less features. Finally the cover data are represented by 28 times less features with a similar good accuracy like the full model and the forest data could be represented with 22 times less features with a slight decay on 7 % in the accuracy. For USPS and MNIST we found that the number of remaining features is still a bit high which can be potentially attributed to a more complex eigenvalue structure of these datasets such that the proposed test was less efficient. The other datasets have basically almost the same accuracy on a drastically reduced feature set. For the coil and the usps data the kernel reconstructions are exemplarily shown in Fig. 1.

6 Conclusions

In this paper we proposed a test for selecting a small set of random fourier features such that the approximated shift invariant kernel is close to the original one with respect to the Frobenius norm. In general we found that the proposed approach is efficient to reduce the number of features, already during the construction, by in general a magnitude or more with low costs with respect to N. The approach is especially applicable if the approximated kernel is of low rank and N is large. Thereby the proposed selection procedure is efficient to obtain small random fourier features sets with high representation accuracy. The effect of sometimes reduced accuracy for random fourier features as observed in [14] could not be confirmed as long as the RFF set is either large enough or appropriately chosen by the proposed method. The proposed approach saves runtime and memory costs during training but is also very valuable if memory is constrained under test conditions e.g. within an embedded system environment. The obtained transformation matrix P has $d \times D$ coefficients which is most often small enough to be of use also under system conditions with limited resources. The original data needs to be transformed into the random fourier feature space using P by a simple matrix multiplication and can subsequently be fed into a linear classifier. The obtained models are in general very efficient as seen above. The small $D*$ also avoids the need to sparsify the linear models by using ridge regression (instead of simple LS) or sparse linear SVM models like the support feature machine [18], such that efficient high performance implementations of linear classifiers can be directly used. In future work we will analyze the effect of our approach on tensor sketching [25] which was used to approximate polynomial kernels.

Notes

1.
We skip the KDDCUP data which is very simple as already reported in [10]. Further for some of the datasets the original configuration was not exactly reconstructable e.g. Adult data such that we could not directly copy results from.
2.
Preprocessed as reported in http://ssdi.di.fct.unl.pt/ nmm/scripts/mdatasets/.
3.
http://www.csie.ntu.edu.tw/~cjlin/liblinear/.

References

Chitta, R., Jin, R., Jain, A.K.: Efficient kernel clustering using random Fourier features. In: 12th IEEE International Conference on Data Mining, ICDM, pp. 161–170. IEEE (2012)
Google Scholar
Villmann, T., Haase, S., Kaden, M.: Kernelized vector quantization in gradient-descent learning. Neurocomputing 147, 83–95 (2015)
Article Google Scholar
Schleif, F.-M., Villmann, T., Hammer, B., Schneider, P.: Efficient kernelized prototype-based classification. J. Neural Syst. 21(6), 443–457 (2011)
Article Google Scholar
Hofmann, D., Schleif, F.-M., Hammer, B.: Learning interpretable kernelized prototype-based models. Neurocomputing 131, 43–51 (2014)
Article Google Scholar
Schleif, F.-M., Zhu, X., Gisbrecht, A., Hammer, B.: Fast approximated relational and kernel clustering. In: Proceedings of ICPR 2012, pp. 1229–1232. IEEE (2012)
Google Scholar
Si, S., Hsieh, C.-J., Dhillon, I.S.: Memory efficient kernel approximation. In: Proceedings of the 31th International Conference on Machine Learning, ICML, volume 32 of JMLR Proceedings, pp. 701–709. JMLR.org (2014)
Google Scholar
Cortes, C., Mohri, M., Talwalkar, A.: On the impact of kernel approximation on learning accuracy. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, AISTATS, volume 9 of JMLR Proceedings, pp. 113–120. JMLR.org (2010)
Google Scholar
Zhang, K., Kwok, J.T.: Clustered Nyström method for large scale manifold learning and dimension reduction. IEEE Trans. Neural Netw. 21(10), 1576–1587 (2010)
Article Google Scholar
Gisbrecht, A., Schleif, F.-M.: Metric and non-metric proximity transformations at linear costs. Neurocomputing 167, 643–657 (2015)
Article Google Scholar
Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Proceedings of the 21st Annual Conference on Neural Information Processing Systems, NIPS 2007. Curran Associates, Inc. (2007)
Google Scholar
Agarwal, A., Kakade, S.M., Karampatziakis, N., Song, L., Valiant, G.: Least squares revisited: scalable approaches for multi-class prediction. In: Proceedings of the 31th International Conference on Machine Learning, ICML, volume 32 of JMLR Proceedings, pp. 541–549. JMLR.org (2014)
Google Scholar
Bunte, K., Kaden, M., Schleif, F.-M.: Low-rank kernel space representations in prototype learning. WSOM 2016. AISC, vol. 428, pp. 341–353. Springer, Switzerland (2016)
Chapter Google Scholar
Schleif, F.-M., Hammer, B., Villmann, T.: Margin based active learning for LVQ networks. Neurocomputing 70(7–9), 1215–1224 (2007)
Article Google Scholar
Yang, T., Li, Y.-F., Mahdavi, M., Jin, R., Zhou, Z.-H., Nystroem method vs random Fourier features: a theoretical and empirical comparison. In: Proceedings of the 26st Annual Conference on Neural Information Processing Systems, NIPS 2012, pp. 485–493 (2012)
Google Scholar
Durrant, R.J., Kabán, A.: Random projections as regularizers: learning a linear discriminant from fewer observations than dimensions. Mach. Learn. 99(2), 257–286 (2015). doi:10.1007/s10994-014-5466-8
Article MathSciNet MATH Google Scholar
Freund, Y., Dasgupta, S., Kabra, M., Verma, N.: Learning the structure of manifolds using random projections. In: Proceedings of the 21st Annual Conference on Neural Information Processing Systems, NIPS 2007. Curran Associates, Inc. (2007)
Google Scholar
Vergara, J.R., Estévez, P.A.: A review of feature selection methods based on mutual information. Neural Comput. Appl. 24(1), 175–186 (2014)
Article Google Scholar
Klement, S., Anders, S., Martinetz, T.: The support feature machine: classification with the least number of features and application to neuroimaging data. Neural Comput. 25(6), 1548–1584 (2013)
Article MathSciNet Google Scholar
Schleif, F.-M., Villmann, T., Zhu, X.: High dimensional matrix relevance learning. In: Proceedings of IEEE Internation Conference on Data Mining Workshop (ICDMW), pp. 661–667 (2014)
Google Scholar
Williams, C.K.I., Seeger, M.: Using the Nyström method to speed up kernel machines. In: Proceedings of the 13th Annual Conference on Neural Information Processing Systems, NIPS 2000, pp. 682–688 (2000)
Google Scholar
Zhang, K., Tsang, I.W., Kwok, J.T.: Improved Nystrom low-rank approximation and error analysis. In: Proceedings of the 25th International Conference on Machine Learning, ICML 2008, pp. 1232–1239. ACM, New York (2008)
Google Scholar
Gittens, A., Mahoney, M.W.: Revisiting the Nystrom method for improved large-scale machine learning. In: Proceedings of the 30th International Conference on Machine Learning, ICML 2013, volume 28 of JMLR Proceedings, pp. 567–575. JMLR.org (2013)
Google Scholar
De Brabanter, K., De Brabanter, J., Suykens, J.A.K., De Moor, B.: Optimized fixed-size kernel models for large data sets. Comput. Stat. Data Anal. 54(6), 1484–1504 (2010)
Article MathSciNet MATH Google Scholar
Kumar, S., Mohri, M., Talwalkar, A.: Sampling methods for the Nyström method. J. Mach. Learn. Res. 13, 981–1006 (2012)
MathSciNet MATH Google Scholar
Pham, N., Pagh, R.: Fast and scalable polynomial kernels via explicit feature maps. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, pp. 239–247. ACM (2013)
Google Scholar

Download references

Acknowledgment

Marie Curie Intra-European Fellowship (IEF): FP7-PEOPLE-2012-IEF (FP7-327791-ProMoS) is greatly acknowledged.

Author information

Authors and Affiliations

School of Computer Science, University of Applied Sciences Würzburg-Schweinfurt, 97074, Würzburg, Germany
Frank-M. Schleif
Computational Intelligence Group, University of Applied Sciences Mittweida, Technikumplatz 17, 09648, Mittweida, Germany
Frank-M. Schleif
School of Computer Science, University of Birmingham, Edgbaston, B15 2TT, UK
Frank-M. Schleif, Ata Kaban & Peter Tino

Authors

Frank-M. Schleif
View author publications
You can also search for this author in PubMed Google Scholar
Ata Kaban
View author publications
You can also search for this author in PubMed Google Scholar
Peter Tino
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Frank-M. Schleif .

Editor information

Editors and Affiliations

Ulm University, Ulm, Germany
Friedhelm Schwenker
Ain Shams University , Cairo, Egypt
Hazem M. Abbas
Cairo University , Orman, Egypt
Neamat El Gayar
Universitá di Siena , Siena, Italy
Edmondo Trentin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schleif, FM., Kaban, A., Tino, P. (2016). Finding Small Sets of Random Fourier Features for Shift-Invariant Kernel Approximation. In: Schwenker, F., Abbas, H., El Gayar, N., Trentin, E. (eds) Artificial Neural Networks in Pattern Recognition. ANNPR 2016. Lecture Notes in Computer Science(), vol 9896. Springer, Cham. https://doi.org/10.1007/978-3-319-46182-3_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-46182-3_4
Published: 09 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46181-6
Online ISBN: 978-3-319-46182-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)