1 Introduction

Feature selection is an important topic of machine learning and data mining, for constructing sparse, accurate and interpretable models [7, 9, 13]. Given a batch of high-dimensional data instances, the overall goal is to find a small subset of relevant features, which are used to construct a low-dimensional predictive model. In modern applications involving streaming data, feature selection is not a “single-shot” offline operation, but an online process that iteratively updates the pool of relevant features, so as to track a sparse predictive model [16, 20]. A prototypical example of online feature selection is the anti-spam filtering task, in which the learner is required to classify each incoming message, using a small subset of features that is susceptible to evolve over time.

Conceptually, the online feature selection problem can be cast as a repeated prediction game between the learner and its environment. During each round t of the game, the learner starts by selecting a subset of at most B features over \(\{1,\cdots ,d\}\), where B is a predefined budget. Upon those selected features is built a predictive model \(\varvec{w}_t\) which, in the present paper, is assumed to be a linear function over \(\mathbb R^d\). Then, a labelled example \((\varvec{x}_t,y_t) \in \mathbb R^d \times \mathbb R\) is supplied by the environment, and the learner incurs a loss \(f(\varvec{w}_t; \varvec{x}_t, y_t)\). The overall goal for the learner is to minimize its cumulative loss over T rounds of the game.

From a computational viewpoint, online feature selection is far from easy since, at each round t, the learner is required to solve a constrained optimization task, characterized by a budget (or \(\ell _0\) pseudo-norm) constraint on the model \(\varvec{w}_t\). Actually, this problem is known to be NP-hard for common loss functions advocated in classification and regression settings [10]. In order to alleviate this difficulty, two main approaches have been proposed in the literature. The first approach is to replace the nonconvex \(\ell _0\) constraint by a convex \(\ell _1\) constraint, or an \(\ell _1\) regularizer [3, 4, 6, 8, 11, 15]. Though this approach is promoting the sparsity of solutions, it cannot guarantee that, at each iteration, the number of selected features is bounded by the predefined budget B. The second approach is divided in two main steps: first, solve a convex, unconstrained optimization problem, and next, seek a new solution that approximates the unconstrained solution while satisfying the \(\ell _0\) constraint. Based on this second approach, the OFS [16] and SOFS [20] strategies exploit truncation techniques for maintaining a budgeted number of features. However, OFS is oblivious to the history of predictions made so far, which might prove useful for assessing the frequencies of features. SOFS uses a suboptimal truncation rule that only considers the confidence of feature values in the current model, but ignores the magnitude of feature values which, again, could prove useful for estimating their relevance. Moreover, Wu et al. [20] did not provide any theoretical analysis for SOFS.

In this paper, we investigate the online feature selection problem using novel truncation techniques. Our contributions are threefold:

  1. 1.

    Two online feature selection algorithms, called Budgeted ARDA (B-ARDA) and Budgeted AMD (B-AMD), are proposed. B-ARDA and B-AMD perform truncation to eliminate irrelevant features. In our paper, the relevance of features is assessed by their frequency in the sequence of predictions, and their magnitude in the current predictor.

  2. 2.

    A detailed regret analysis for both algorithms is provided, which captures the intuition and rationale behind our truncation techniques.

  3. 3.

    Experiments on six high-dimensional datasets reveal the superiority of the proposed algorithms compared with both advanced feature selection algorithms and \(\ell _1\)-based online learning algorithms.

The paper is organized as follows. Section 2 provides some related work in feature selection and online learning. Section 3 presents the notation used throughout the paper and elaborates on the problem setting. Our learning algorithms and their regret analysis are detailed in Sect. 4. Comparative experiments are given in Sect. 5. Finally, Sect. 6 concludes the paper.

2 Related Work

Feature selection is a well-studied topic in machine learning and data mining [1, 7, 23]. Existing feature selection approaches include batch (or offline) methods and online methods. Batch methods, examined for instance in [12,13,14, 18], typically require an access to all available data, which makes them difficult to operate on sequential data. On the other hand, online methods are more suited to handle large-scale, and potentially streaming, information. Currently, there are two different “online modes” for selecting features. The first mode assumes that the number of examples is fixed but features arrive sequentially over time, such as in [17, 19, 22]. Contrastingly, the second mode assumes that the number of features is known in advance, but examples are supplied one by one, as studied for example in [16, 20]. We focus here on the second online mode, which is more natural for real-world streaming data. According to this mode, online feature selection methods can be grouped into three categories, summarized in Table 1.

Table 1. A list of recent works in online feature selection

\(\ell _1\) Constraint/Regularization. Methods enforcing \(\ell _1\) constraints project the solution \(\varvec{w}\) after gradient descent update onto an \(\ell _1\) ball with radius r. Recent works, such as [2, 5], focus on designing efficient projection algorithms. There are also many researches which aim at solving an \(\ell _1\)-regularized convex optimization problem. Notably, in [3], Duchi et al. propose the Fobos algorithm, which first performs a sub-gradient descent in order to get an intermediate solution, and then seeks a new solution that stays close to the intermediate solution and has a low \(\ell _1\) norm complexity. The second stage can be solved efficiently by truncating coefficients below a threshold in the intermediate solution. In [8], Langford et al. claim that such truncation operation is too aggressive and propose an alternative truncated gradient technique (TrunGrad), which gradually shrinks the coefficients to zero by a small amount. In [6], Duchi et al. generalize the Online Mirror Descent (OMD) to regularized losses, and propose the Composite Mirror Descent (CMD) algorithm, which exploits the composite structure of the objective to get desirable effects. Their derived algorithms include Fobos as an special case. In [21], Xiao presents an \(\ell _1\)-Regularized Dual Averaging algorithm (\(\ell _1\)-RDA) which, at each iteration, minimizes the sum of three terms: a linear function obtained by averaging all previous sub-gradients, an \(\ell _1\) regularization term and an additional strongly convex regularization term. In [4], Duchi et al. propose ARDA and ACMD, which adaptively modify the proximal function in order to incorporate the information related to the geometry of data observed in earlier iterations. The derived algorithms, \(\ell _1\)-ARDA and Ada-Fobos, achieve better performance than their non-adaptive versions, namely, \(\ell _1\)-RDA and Fobos. In [15], Wang et al. present a framework for sparse online classification. Their methods perform feature selection by carefully tuning the \(\ell _1\) regularization parameter.

\(\ell _0\) Truncation. In contrast with the above approaches, Jin et al. [16] propose a truncation method that satisfies the budget (or \(\ell _0\)) constraint at each iteration. Their OFS algorithm first projects the predictor \({\varvec{w}}\) (obtained from gradient descent) onto an \(\ell _2\) ball, so that most of the numerical values of \({\varvec{w}}\) are concentrated to their largest elements, and then keeps only the B largest weights in \({\varvec{w}}\). Wu et al. [20] further explore the truncation method for a confidence-weighted learning algorithm AROW, and proposed SOFS, which simply truncates the elements with least confidence after the update step in the diagonal version of AROW.

Our proposed online feature selection algorithms are also based on truncation techniques. Yet, our approaches differ from OFS and SOFS in the sense that truncation strategies are tailored to advanced adaptive sub-gradient methods, namely ARDA and AMD, which can perform more informative gradient descent, and which can find highly discriminative but rarely seen features. Moreover, we provide a detailed regret analysis for truncated versions of ARDA and AMD.

3 Notation and Problem Setting

In what follows, lowercase letters denote scalars or vectors, and uppercase letters represent matrices. An exception is the parameter B that captures our budget on the number of selected features. Let [d] denote the set \(\{1,\cdots ,d\}\). We use \(\varvec{I}\) to denote the identity matrix, and \(\mathrm {diag}(\varvec{v})\) to denote the diagonal matrix with vector \({\varvec{v}}\) on the diagonal. For a linear predictor \({\varvec{w}}_t\) chosen at iteration t, we use \(w_{t,i}\) to denote its ith entry. As usual, we use \(\langle {\varvec{v}}, {\varvec{w}} \rangle \) to denote the inner product between \({\varvec{v}}\) and \({\varvec{w}}\), and for any \(p \in [1,\infty ]\), we use \(||{\varvec{w}}||_p\) to denote the \(\ell _p\) norm of \({\varvec{w}}\). We also use \(||\varvec{w}||_0\) to denote the \(\ell _0\) pseudo-norm of \(\varvec{w}\), that is, \(||\varvec{w}||_0 = |\{i \in [d]: w_i \ne 0\}|\). For a convex loss function \(f_t\), the sub-differential set of \(f_t\) at \({\varvec{w}}\) is denoted by \(\partial f_t({\varvec{w}})\), and \({\varvec{g}}_t\) is used to denote a sub-gradient of \(f_t\) at \({\varvec{w}}_t\), i.e. \({\varvec{g}}_t \in \partial f_t({\varvec{w}}_t)\). When \(f_t\) is differentiable at \(\varvec{w}\), we use \(\nabla f_t({\varvec{w}})\) to denote its unique sub-gradient (called gradient). Let \({\varvec{g}}_{1:t} = [{\varvec{g}}_1 \ {\varvec{g}}_2 \ \cdots {\varvec{g}}_t]\) be a \(d \times t\) matrix obtained by concatenating the sub-gradients \({\varvec{g}}_j\) from \(j=1\) to t. The ith row vector of \({\varvec{g}}_{1:t}\) is denoted by \({\varvec{g}}_{1:t,i}\). Let \(\psi _t\) be a strictly convex and continuously differentiable function defined, at each iteration t, on a closed convex set \(\mathcal {C} \subseteq \mathbb {R}^d\) and let \(\mathcal {D}_{\psi _t}({\varvec{x}}, {\varvec{y}}) \) denote the corresponding Bregman divergence, given by:

$$\begin{aligned} \mathcal {D}_{\psi _t}({\varvec{x}}, {\varvec{y}}) = \psi _t({\varvec{x}}) - \psi _t({\varvec{y}}) - \langle \nabla \psi _t({\varvec{y}}), {\varvec{x}}-{\varvec{y}} \rangle , \quad \forall {\varvec{x}}, {\varvec{y}} \in \mathcal {C}. \end{aligned}$$

By construction, we have \(\mathcal {D}_{\psi _t}({\varvec{x}}, {\varvec{y}}) \ge 0 \) and \(\mathcal {D}_{\psi _t}({\varvec{x}}, {\varvec{x}}) = 0 \) for all \({\varvec{x}}, {\varvec{y}} \in \mathcal C\).

As mentioned above, the online feature selection problem can be formulated as a repeated prediction game between the learner and its environment. At iteration t, a new data point \({\varvec{x}}_t \in \mathbb {R}^{d}\) is supplied to the learner, which is required to predict a label for \(\varvec{x}_t\) according to its current model \(\varvec{w}_t\). We assume that \(\varvec{w}_t\) is a sparse linear function in \(\mathbb {R}^d\) such that \(||\varvec{w}_t||_0 \le B\), where B is a predefined budget. Once the learner has committed to its prediction, the true label \(y_t \in \mathbb {R}\) of \({\varvec{x}}_t\) is revealed, and the learner suffers a loss \(l({\varvec{w}}_t; ({\varvec{x}}_t, y_t))\). We use here \( l_t({\varvec{w}}_t) = l({\varvec{w}}_t; ({\varvec{x}}_t, y_t))\), and we assume that \(l_t({\varvec{w}}_t) = f_t({\varvec{w}}_t) + \varphi ({\varvec{w}}_t)\), where \(f_t({\varvec{w}}_t)\) is a convex loss function and \(\varphi ({\varvec{w}}_t)\) is a regularization function. The performance of the learner is measured according to its regret:

$$\begin{aligned} \mathcal {R}^{T} = \sum _{t=1}^T l_t({\varvec{w}}_t) - \min _{{\varvec{w}} \in \mathbb {R}^d: || \varvec{w} ||_0 \le B} \sum _{t=1}^T l_t({\varvec{w}}), \end{aligned}$$

where \(||{\varvec{w}}_t||_0 \le B\) for all t. Our goal is to devise online feature selection strategies for which, regrets are sublinear in T. The nonconvex \(\ell _0\) constraint makes our problem more challenging than standard online convex optimization tasks.

4 B-ARDA and B-AMD

Advanced ARDA and AMD algorithms can take full advantage of the sub-gradient information observed in earlier iterations to perform more informative learning. Since ARDA and AMD are different methods, we need to develop specific truncation strategies for each of them.

4.1 B-ARDA and Its Regret Analysis

A straightforward approach for performing \(\ell _0\) truncations is to keep the B elements with largest magnitude (in absolute value) in the current predictor \({\varvec{w}}_t\). Such a naive approach suffers from an important shortcoming: frequently occurring discriminative features tend to be removed. This flaw results from the updating rule of adaptive sub-gradient methods: frequent attributes are given low learning rates, while infrequent attributes are given high learning rates.

Thus, we need to consider a more sophisticated truncation approach which takes into account the frequencies of features, together with their magnitude. To this end, we present the pseudocode of B-ARDA described in Algorithm 1. Basically, B-ARDA starts with a standard ARDA iteration from Step 1 to Step 9, and provides an intermediate solution \({\varvec{z}}_{t+1}\), for which \(||{\varvec{z}}_{t+1} ||_0 \le B\) may not hold; then at Step 10, the algorithm truncates \({\varvec{z}}_{t+1}\) in order to find a new solution \({\varvec{w}}_{t+1}\) so that \(||{\varvec{w}}_{t+1}||_0 \le B\) is satisfied. In our truncation operation, we consider both the magnitude of elements in \({\varvec{z}}_{t+1}\), and the frequency of features conveyed by the diagonal matrix \(\varvec{H}_t\).

figure a

Note that the update at Step 9 often takes a closed-form. For example, if we use the standard Euclidean regularizer \(\varphi (\varvec{w}) = \frac{\lambda }{2} ||\varvec{w}||_2^2\), we get that

$$ \varvec{z}_{t+1} = - \eta (\lambda \eta t \varvec{I} + \varvec{H}_t)^{-1} \sum _{i=1}^t \varvec{g}_i. $$

The truncation operation at Step 10 can be efficiently solved by a simple greedy procedure. Let \(\varvec{v}_{t+1} \in \mathbb {R}^d\) be the vector with entries \(v_{t+1,j} = H_{t,jj} z_{t+1,j}^2\). Based on this notation, if \(||\varvec{z}_{t+1}||_0 \le B\), \(\varvec{w}_{t+1} = \varvec{z}_{t+1}\); otherwise, \(\varvec{w}_{t+1} = \varvec{z}_{t+1}^B\), where

$$ z_{t+1,i}^B = {\left\{ \begin{array}{ll} z_{t+1,i} &{} \text{ if } H_{t,ii} z_{t+1,i}^2 \text{ occurs } \text{ in } \text{ the } B \text{ largest } \text{ values } \text{ of } \varvec{v}_{t+1},\\ 0 &{} \text{ otherwise. } \end{array}\right. } $$

The following result demonstrates that our truncation strategy for ARDA can lead to a sublinear regret. The proof, built essentially on the work of [4], is included in Appendix 1 for completeness.

Theorem 1

Let \(\xi _{t}^2 = \langle \varvec{w}_t - \varvec{z}_t, \varvec{H}_{t-1} (\varvec{w}_t - \varvec{z}_t) \rangle \), which is the factual truncation error at iteration \(t-1\). Set \(\max _t ||\varvec{g}_t||_{\infty } \le \delta \) and \(\max _t \xi _t \le \xi \). For any \(\varvec{w}^* \in \mathbb {R}^d\), B-ARDA achieves the following regret bound:

$$\begin{aligned} \mathcal {R}_{\text {B-ARDA}}^T \le \frac{\delta }{2 \eta } ||\varvec{w}^*||_2^2 + \left( \frac{1}{2 \eta } ||\varvec{w}^*||_{\infty } + \eta \right) \sum _{i=1}^d ||\varvec{g}_{1:T, i}||_2 + \xi \sqrt{2 T \sum _{i=1}^d ||\varvec{g}_{1:T, i}||_2}. \end{aligned}$$

To see why the bound is sublinear, we notice from [4] that

$$ \sum _{i=1}^d ||\varvec{g}_{1:T,i}||_2 = \sqrt{d} \sqrt{ \inf _{\varvec{s}: \varvec{s} \succeq 0, \langle 1, \varvec{s} \rangle \le d} \left\{ \sum _{t=1}^T \langle \varvec{g}_t, \mathrm {diag}(\varvec{s})^{-1} \varvec{g}_t \rangle \right\} } \le \sqrt{d} \sqrt{\sum _{t=1}^T ||\varvec{g}_t ||_2^2}. $$

For the maximum truncation error \(\xi = 0\), we directly recover the regret bound of ARDA. If \(\xi \ne 0 \), we get bounds of the form:

  1. 1.

    if \(\xi \) is \(O(||\varvec{w}^*||_{\infty } \sqrt{\sum _{i=1}^d ||\varvec{g}_{1:T, i}||_2/T})\), \(\mathcal {R}_{\text {B-ARDA}}^T = O (||\varvec{w}^*||_{\infty } \sum _{i=1}^d ||\varvec{g}_{1:T, i}||_2)\).

  2. 2.

    if \(\xi \) is \(\varOmega (||\varvec{w}^*||_{\infty } \sqrt{\sum _{i=1}^d ||\varvec{g}_{1:T, i}||_2/T})\), \(\mathcal {R}_{\text {B-ARDA}}^T = O(\xi \sqrt{2 T \sum _{i=1}^d ||\varvec{g}_{1:T, i}||_2})\).

In other words, the cumulative loss of B-ARDA using only B features converges to that of an optimal solution in hindsight as T approaches infinity. The value of \(\xi \) is determined by the budget parameter B; larger values of B produce a smaller \(\xi \), while smaller values of B yield a larger \(\xi \).

We mention in passing that the naive truncation method, described in the beginning of this section, may be implemented by replacing the Step 10 in Algorithm 1 with

$$ \varvec{w}_{t+1} = \arg \min _{\varvec{w} \in \mathbb {R}^d} \langle \varvec{w} - \varvec{z}_{t+1}, \varvec{w} - \varvec{z}_{t+1} \rangle , \text { subject to } ||\varvec{w}||_0 \le B. $$

The regret produced by such truncation is, however, not sublinear since:

$$\sum _{t=1}^T \langle \varvec{g}_t, \varvec{w}_t - \varvec{z}_t \rangle \le \sum _{t=1}^T ||\varvec{g}_t||_2 ||\varvec{w}_t - \varvec{z}_t ||_2 \le \xi \sum _{t=1}^T ||\varvec{g}_t||_2 \quad (\xi _t = ||\varvec{w}_t - \varvec{z}_t ||_2).$$

4.2 B-AMD and Its Regret Analysis

We now focus on a truncation technique for the sub-gradient method AMD. Our approach is also considering both the magnitude of elements and the frequency of features. The pseudocode of B-AMD is presented in Algorithm 2, where \(\mathcal {D}_{\psi _t} (\varvec{w}, \varvec{w}_t)\) is the Bregman divergence between \(\varvec{w}\) and \(\varvec{w}_t\). Note that we use AMD rather than ACMD since we do not use the composite structure of the objective function, but the truncation operation, to produce sparse solutions.

figure b

In essence, B-AMD performs an AMD iteration and then truncates the returned solution. Importantly, the AMD update at Step 9 admits a closed-form solution: \(\varvec{z}_{t+1} = \varvec{w}_t - \eta \varvec{H}_t^{-1} \mathbf {g}_t\). Similarly to B-ARDA, the truncation operation at Step 10 can be solved efficiently: if \(||\varvec{z}_{t+1}||_0 \le B\), \(\varvec{w}_{t+1} = \varvec{z}_{t+1}\); otherwise, \(\varvec{w}_{t+1} = \varvec{z}_{t+1}^B\) where \(\varvec{z}_{t+1,i}^B = \varvec{z}_{t+1,i}\) if \(H_{t,ii} | \varvec{z}_{t+1,i}|\) occurs in the B largest values of \(\{H_{t,jj} |\varvec{z}_{t+1,j}|, j \in [d]\}\), and \(\varvec{z}_{t+1,i}^B = 0\), otherwise.

The next theorem provides a regret bound for B-AMD, and conveys the rationale for the designed truncation. The proof is given in Appendix 2.

Theorem 2

Set \(\xi _t = \sum _{i=1}^d H_{t,ii} | {z}_{t+1,i} - {w}_{t+1,i}|\). For any \(\varvec{w}^* \in \mathbb {R}^d\), B-AMD achieves the following regret bound:

$$\begin{aligned} \mathcal {R}_{\text {B-AMD}}^T \le \frac{1}{\eta } ||\varvec{w}^*||_{\infty } \sum _{t=1}^T \xi _t + \left( \frac{1}{2 \eta } \max _{t \le T} ||\varvec{w}^* - \varvec{w}_t||_{\infty }^2 + \eta \right) \sum _{i=1}^d ||\varvec{g}_{1:T, i}||_2, \end{aligned}$$

where the first term of right-hand side is obtained from truncation.

Informally, the regret bound in Theorem 2 indicates that the cumulative loss of B-AMD converges toward the cumulative loss of the optimal \(\varvec{w}^*\) as T tends toward infinity, and the gap between the two is mainly dominated by the sum of truncation errors, that is, \(\sum _{t=1}^T \xi _t\). This observation implies that we should try to minimize \(\xi _t\) at each round in order to reduce the gap. If the truncation error is set to \(\xi _t = 0\) for any t, the regret bound of AMD is immediately recovered.

5 Experiments

This section reports two experimental studiesFootnote 1. In the first experiment, we compare B-ARDA and B-AMD with OFS and SOFS; in the second one, we compare our algorithms with \(\ell _1\)-ARDA and Ada-Fobos, which achieve feature selection by carefully tuning the \(\ell _1\) regularization parameter. Although the theoretical analysis of our algorithms holds for many convex losses and regularization functions, we use here the squared hinge loss and \(\ell _2\) regularizer, that is, \( f_t(\varvec{w}_t) = (\max \{0, 1 - y_t \langle \varvec{w}_t, \varvec{x}_t \rangle \})^2\) and \( \varphi (\varvec{w}_t) = \frac{\lambda }{2} ||\varvec{w}_t||_2^2\).

5.1 Datasets

Our experiments were performed on six high-dimensional binary classification datasets, selected from different domains. Their statistics are presented in Table 2, where “data density” is the maximal number of non-zero features per instance divided by the total number of features. Arcene’s task is to distinguish cancer versus normal patterns from mass-spectrometric data. Dexter and farm_ads are text classification problems in a bag-of-words representation. Gisette aims to separate the highly confusable digits ‘4’ and ‘9’. The above four datasets are available in UCI repository. Pcmac and basehock are a subset extracted from 20newsGroupFootnote 2. Pcmac is to separate documents from “ibm.pc.hardware” and “mac.hardware”, and basehock is to distinguish “baseball” versus “hockey”.

Table 2. A summary of datasets

5.2 Comparison with Online Feature Selection Algorithms

We first compared B-ARDA and B-AMD with OFS [16] and SOFS [20] on datasets in Table 2. For OFS, B-ARDA and B-AMD algorithms, the regularization parameter \(\lambda \) and the step-size \(\eta \) were obtained by choosing values in \(\{10^{-1}, 10^{-1.5}, \cdots , 10^{-8}\}\), and taking the best performance in the training set. A similar interval was used for selecting the best parameter \(1/ \gamma \) for SOFS. We set \(\delta = 10^{-2}\) for B-ARDA and B-AMD on all datasets. Based on these empirically optimal parameter values, we vary the budget B in order to plot the test accuracy versus the number of selected features.

In order to make our results reliable under the optimal parameter setting, each algorithm was run 10 times, each time with \(\tau \) passes on the training examples. Namely, each pass is done with a random permutation of the training set, and the classifier output at the end of \(\tau \) passes is evaluated on a separated test set. The number of passes \(\tau \) was set as \(\lceil \frac{2d}{n} \rceil \) for each dataset. Figures 1 and 2 display the average test accuracy of all algorithms for varying feature budgets.

Fig. 1.
figure 1

Test performance w.r.t. OFS and SOFS (small feature budgets)

Fig. 2.
figure 2

Test performance w.r.t. OFS and SOFS (large feature budgets)

Based on Fig. 1, we can observe that B-ARDA achieves the highest test accuracy for every budget parameter B. By contrast, B-AMD is outperformed by B-ARDA, but remains better than SOFS. By coupling Figs. 1 and 2, we observe that the performance gap between B-ARDA and the other algorithms decreases as the budget B increases. The results for B-AMD are mixed: for small values of B, this strategy is outperformed by OFS, due to a large truncation error; but when the budget is gradually increasing, B-AMD outperforms OFS at some value of B. For example, on the gisette and farm_ads datasets, B-AMD outperforms OFS at \(B \ge 1000\) and \(B \ge 2000\), respectively. SOFS achieves poor accuracy for small budgets, but its performance is approaching B-ARDA and B-AMD by increasing B. This steams from the fact SOFS tends to keep more features to achieve an accuracy that is competitive with that of B-ARDA and B-AMD. We can clearly see that B-ARDA, B-AMD and SOFS are all outperforming OFS for large values of B. To sum up, when a small number of features is desired, B-ARDA is the best choice, and when more features are allowed, both B-ARDA and B-AMD are better than OFS and SOFS.

Fig. 3.
figure 3

Test performance w.r.t. \(\ell _1\)-ARDA and Ada-Fobos (small feature budgets)

5.3 Comparison with Sparse Online Learning Algorithms

We have also compared our proposed algorithms with \(\ell _1\)-ARDA [4] and Ada-Fobos [4], which achieve feature selection by carefully tuning the \(\ell _1\) regularization parameter. For fair comparisons, the choice of step-sizes follows the experimental setup in Sect. 5.2. Once the step-size value is determined, the \(\ell _1\) regularization parameter is gradually modified for deriving different numbers B of features for \(\ell _1\)-ARDA and Ada-Fobos. For B-ARDA and B-AMD, the input budget values B are those obtained by \(\ell _1\)-ARDA and Ada-Fobos, respectively. Figure 3 presents the test accuracy of these algorithms when a small number of features is selected. The plot for Ada-Fobos does not appear in some subfigures since its accuracy falls outside the specified range.

Based on Fig. 3, we observe that both B-ARDA and \(\ell _1\)-ARDA outperform B-AMD and Ada-Fobos. This indicates that the regularized dual averaging method is more competitive than the mirror descent method especially when very sparse solutions are desired. Remarkably, B-ARDA is better than \(\ell _1\)-ARDA when a small number of features is required, which means that our truncation strategy for ARDA is successful. We also notice that Ada-Fobos has a poor performance for small budgets; by contrast, B-AMD is much better. We do not present the plots for large number of features due to space constraints, but we report the observed results: as the number of features increases, the performance gaps among these algorithms are gradually shrinking, and finally, these algorithms empirically attain a similar test accuracy. Yet, from a practical viewpoint, it is much simpler to select a desired number of features for B-ARDA and B-AMD. For \(\ell _1\)-ARDA and Ada-Fobos, the number of features cannot be determined in advance: it is empirically conditioned by the choice of the regularization parameter.

6 Conclusion

In this paper, two novel online feature selection algorithms, called B-ARDA and B-AMD, have been proposed and analyzed. Both algorithms perform feature selection via truncation techniques, which take into account the magnitude of feature values in the current predictor, together with the frequency of features in the observed data stream. By taking as input a desired budget, both algorithms are easy to control, especially in comparison with \(\ell _1\)-based feature selection techniques. We have shown on six high-dimensional datasets that B-ARDA outperforms advanced OFS and SOFS especially when a small number of features is required; when more features are allowed, both B-ARDA and B-AMD are better than OFS and SOFS. Compared with \(\ell _1\)-ARDA and Ada-Fobos that achieve feature selection by carefully tuning the \(\ell _1\) regularization parameter, B-ARDA is shown to superior to \(\ell _1\)-ARDA and B-AMD superior to Ada-Fobos, which corroborates the interest of our truncation strategies. A natural perspective of research is to investigate whether our approach may be extended to “structured” feature selection tasks, such as group structures.