Advertisement

Stream-suitable optimization algorithms for some soft-margin support vector machine variants

  • Hien D. Nguyen
  • Andrew T. Jones
  • Geoffrey J. McLachlan
Perspectives on data science for advanced statistics

Abstract

Soft-margin support vector machines (SVMs) are an important class of classification models that are well known to be highly accurate in a variety of settings and over many applications. The training of SVMs usually requires that the data be available all at once, in batch. The Stochastic majorization–minimization (SMM) algorithm framework allows for the training of SVMs on streamed data instead. We utilize the SMM framework to construct algorithms for training hinge loss, squared-hinge loss, and logistic loss SVMs. We prove that our three SMM algorithms are each convergent and demonstrate that the algorithms are comparable to some state-of-the-art SVM-training methods. An application to the famous MNIST data set is used to demonstrate the potential of our algorithms.

Keywords

Big data MNIST Stochastic majorization–minimization algorithm Streamed data Support vector machines 

1 Introduction

The soft-margin support vector machines—which we shall refer to as SVMs from hereon in—were first proposed in Cortes and Vapnik (1995). Ever since their introduction, the SVMs have become a popular and powerful data analytic tool for conducting classification of labeled data at all scales. Some good texts that treat the various aspects of SVMs are Scholkopf and Smola (2002), Steinwart and Christmann (2008), and Abe (2010).

Let \(\varvec{Z}^{\top }=(\varvec{X}^{\top },Y)\), where \(\varvec{X}\in \mathbb {X}\subset \mathbb {R}{}^{p}\) and \(Y\in \{ -1,1\}\) are random variables, where \(p\in \mathbb {N}\) (zero exclusive) and \((\cdot )^{\top }\) is the transposition operator. Furthermore, suppose that \(\varvec{Z}\) is distributed with some probability law \(\mathcal {L}_{\varvec{Z}}\), which we denote as \(\varvec{Z}\sim \mathcal {L}_{\varvec{Z}}\). Define the classification loss function to be \(l_{\text {C}}(\varvec{z};\varvec{\theta })=\mathbb {I}\{ y\tilde{\varvec{x}}^{\top }\varvec{\theta }<0\}\), where \(\tilde{\varvec{x}}^{\top }=(1,\varvec{x}^{\top })\), \(\varvec{\theta }^{\top }=(\alpha ,\varvec{\beta }^{\top })\in \Theta \subset \mathbb {R}^{p+1}\), and \(\mathbb {I}\{ a\}\) is the indicator function that takes value 1 if statement a is true and 0 otherwise. Under the setup above, we seek to determine a parameter vector \(\hat{\varvec{\theta }}\) that solves the optimization problem:
$$\begin{aligned} \min _{\varvec{\theta }\in \Theta }\mathbb {E}_{\varvec{Z}\sim \mathcal {L}_{\varvec{Z}}}[l_{\text {C}}(\varvec{Z};\varvec{\theta })]. \end{aligned}$$
(1)
The solution \(\hat{\varvec{\theta }}^{\top }=(\hat{\alpha },\hat{\varvec{\beta }}^{\top })\) to Problem (1) yields an optimal separating hyperplane of form \(\hat{\alpha }+\varvec{x}^{\top }\hat{\varvec{\beta }}=0\), which can be used as a predictor for any realized datum \(\varvec{X}=\varvec{x}\) for which Y is unknown. That is, for some observed \(\varvec{x}\), we can make a predication of Y via the classification rule \(\hat{y}={\text {sign}}(\tilde{\varvec{x}}^{\top }\hat{\varvec{\theta }})\), where \({\text {sign}}(a)=-1\) if \(a<0\), and \({\text {sign}}(a)=1\) otherwise.
Of course, Problem (1) is merely theoretical and must be made concrete in the context of real-data analysis. Suppose now that we have a sample of n IID (independent and identically distributed) observations \(\{ \varvec{Z}_{i}\}\), where \(\varvec{Z}_{i}\sim \mathcal {L}_{\varvec{Z}}\) (\(i\in [n]=\{ 1,\dots ,n\}\)). We can then approximate Problem 1 by the average loss optimization problem:
$$\begin{aligned} \min _{\varvec{\theta }\in \Theta }\frac{1}{n}\sum _{i=1}^{n}l_{\text {C}}(\varvec{Z}_{i};\varvec{\theta }), \end{aligned}$$
(2)
which can then be further made concrete upon substitution of some observed instance \(\{ \varvec{z}_{i}\}\) in place of the random variable \(\{ \varvec{Z}_{i}\}\). Let the parameter vector \(\varvec{\theta }^{n}\) solve Problem (2). We can then use \(\varvec{\theta }^{n}\) to construct an optimal separating hyperplane in the same way that we use \(\hat{\varvec{\theta }}\).
Unfortunately, any particular instance of Problem (2) can be highly irregular, combinatorial, and even NP-Hard (cf. Steinwart and Christmann 2008, Ch. 3). As such, a surrogate problem is often substituted in place of Problem (2) by replacing the classification loss \(l_{\text {C}}(\varvec{z};\varvec{\theta })\) for simpler functions with more desirable properties, such as continuity, convexity, or differentiability. The following problem format make such substitutions and will be referred to broadly as SVM risks:
$$\begin{aligned} \min _{\varvec{\theta }\in \Theta }r(\varvec{\theta }), \end{aligned}$$
(3)
where
$$\begin{aligned} r\left( \varvec{\theta }\right) \equiv \frac{1}{n}\sum _{i=1}^{n}l\left( \varvec{Z}_{i};\varvec{\theta }\right) +P\left( \varvec{\theta }\right) , \end{aligned}$$
and where the first term of \(r(\varvec{\theta })\) is the surrogate average loss (where \(l(\varvec{z};\varvec{\theta })\) is the surrogate loss function) and the second term is a penalty that constrains the parameter space of \(\varvec{\beta }\). In this article, we shall consider only the classic ridge penalty of the form \(P(\varvec{\beta })=\lambda \varvec{\beta }^{\top }\varvec{\beta }\), where \(\lambda \ge 0\).

Over the years, there have been many proposals for the choice of a surrogate loss function. The original SVM risk of Cortes and Vapnik (1995) utilized the hinge loss function \(l_{\text {H}}(\varvec{z};\varvec{\theta })=\left[ 1-y\tilde{\varvec{x}}^{\top }\varvec{\theta }\right] _{+}\), where \([a]_{+}=\max \{ 0,a\}\). In our related article, Nguyen and McLachlan (2017), we considered the squared-hinge loss \(l_{\text {S}}(\varvec{z};\varvec{\theta })=\left[ 1-y\tilde{\varvec{x}}^{\top }\varvec{\theta }\right] _{+}^{2}\), as well as the logistic loss \(l_{\text {L}}(\varvec{z};\varvec{\theta })=\log [1+\exp (-y\tilde{\varvec{x}}^{\top }\varvec{\theta })]\). Some other popular surrogate losses are suggested in Zhang (2004).

In Cortes and Vapnik (1995), it was found that any instance of Problem (3) with a hinge loss surrogate was a quadratic program and thus could be solved via any general quadratic programming solver. The literature now contains an abundance of methods for solving Problem (3) under various settings. A comprehensive review and critique of current solution techniques to Problem (3) appears in Shawe-Taylor and Sun (2011). Of particular relevance to the reader of this article, Navia-Vasquez et al. (2001), Groenen et al. (2008), and Nguyen and McLachlan (2017) all considered iteratively reweighted least-squares (IRLS) approaches for solving different variants of Problem (3) for batch data (static data that are available all at once).

The leading paradigm that defines the Big Data context is the notion of the three V’s (cf. McAfee et al. 2012), where the V’s each stand for variety, velocity, and volume. Variety is generally addressed via the modeling of data and model choice, whereas velocity (the fact that data are not static and accumulated over time) and volume (the fact that data are large in comparison to modern computing resources) require careful consideration of the manner in which models are fitted once they are chosen. In this article, we consider a new algorithm for solving Problem (3) for various loss functions that are designed to address the problems of having potentially high volume and high velocity data.

In Nguyen and McLachlan (2017), we showed that it was possible to construct IRLS algorithms for solving various instances of Problem (3) for batch data, using the majorization–minimization (MM) algorithm paradigm of Lange (2016). The constructed algorithms were demonstrated to exhibit convergent behavior, where upon the solution \(\hat{\varvec{\theta }}\) approaches the global minimizer of the respective problem instance as the number of iterations of each algorithm approaches infinity.

In this article, we utilize the recently developed stochastic MM (SMM) algorithm of Razaviyayn et al. (2016) to solve the stochastic version of Problem (3):
$$\begin{aligned} \min _{\varvec{\theta }\in \Theta }R\left( \varvec{\theta }\right) \equiv \mathbb {E}_{\varvec{Z}\sim \mathcal {L}_{\varvec{Z}}}\left[ l\left( \varvec{Z};\varvec{\theta }\right) +P\left( \varvec{\theta }\right) \right] , \end{aligned}$$
(4)
upon observation of some infinite stream of data \(\left\{ \varvec{z}_{i}\right\}\), where each \(\varvec{z}_{i}\) is a realization of \(\varvec{Z}_{i}\sim \mathcal {L}_{\varvec{Z}}\). The particular instances of the problem that we solve are for the same loss functions as those that were addressed in Nguyen and McLachlan (2017). The resulting algorithms that are produced have weight least-squares (WLS) or least-squares (LS) like forms as each datum \(\varvec{z}_{i}\) is introduced into the available data set from which model fitting can be conducted. At its introduction, a single weight for datum \(\varvec{z}_{i}\) is computed and it is never again reweighted for the entirety of the optimization process. This is in contrast with the IRLS algorithms of Nguyen and McLachlan (2017), which require the iterative reweighting of each observation \(\varvec{z}_{i}\) at each iteration of the algorithm, until its convergence. Thus, we can observe directly how the new algorithms address the problems of high velocity and volume high data, by allowing for data to arrive in a stream and to not perform reweighting operations that are unnecessary to the performance of the algorithm.

In addition to the qualitative advantages that we have described above, the newly constructed algorithms can also be proved to be globally convergent. That is, as the number of observations n in the data stream \(\left\{ \varvec{z}_{i}\right\}\) increases, each of the algorithms produce a solution \(\varvec{\theta }^{n}\) to Problem (4) that approaches the global minimizer of the problem with probability one. This is a useful guarantee that corresponds contextually to the convergence results that were obtained in Nguyen and McLachlan (2017).

We complement our theoretical results with some numerical simulations that display the typical performance of the constructed algorithms in various settings. As a demonstration, we also apply our algorithms to a classification problem involving the classic MNIST data of LeCun (1998).

The rest of the article proceeds as follows. A description of the SMM optimization framework proposed in this article is provided in “Stochastic MM algorithm”. In third section, we derive the SMM algorithms for the addressed SVMs. In “Convergence analysis”, theoretical results are presented regarding the convergence of each of the algorithms. Numerical simulations are then provided in fifth section. In the sixth section, we demonstrate the algorithms via applications to the MNIST data classification problem. Conclusions are then drawn in the final section.

2 Stochastic MM algorithm

The SMM algorithm that we present here is the one described by Razaviyayn et al. (2016). An alternative approach to the construction of stochastic MM-type optimization schemes was also considered by Mairal (2013). However, the approach of Mairal (2013) results in a somewhat more complicated set of iterations and convergence conditions than the set of Razaviyayn et al. (2016). The SMM algorithm that is discussed is further connected to the stochastic expectation–maximization (EM) algorithm of Titterington (1984) and the online EM algorithm of Cappe and Moulines (2009). Each of these four mentioned frameworks are suitable for various different settings and tasks.

We now consider optimization problems of the form:
$$\begin{aligned} \min _{\varvec{\gamma }\in \Gamma }f\left( \varvec{\gamma }\right) \equiv \mathbb {E}_{\varvec{W}\sim \mathcal {L}_{\varvec{W}}}\left[ g_{1}\left( \varvec{\gamma };\varvec{W}\right) +g_{2}\left( \varvec{\gamma };\varvec{W}\right) \right] , \end{aligned}$$
(5)
where \(\Gamma\) is a bounded and closed subset of \(\mathbb {R}^{d}\) and \(\varvec{W}\in \mathbb {W}\subset \mathbb {R}^{q}\) (\(d,q\in \mathbb {N}\)) is a random variable with law \(\mathcal {L}_{\varvec{W}}\). Furthermore, let \(g_{1}\left( \varvec{\gamma };\varvec{w}\right)\) be twice continuously differentiable but possibly not convex in \(\varvec{\gamma }\) on a bounded and open set \(\tilde{\Gamma }\) containing \(\Gamma\), and let \(g_{2}\left( \varvec{\gamma };\varvec{w}\right)\) be convex but possibly not smooth in \(\varvec{\gamma }\).
In the optimization literature, a popular approach to solving Problem (5), upon obtaining some stream of data \(\left\{ \varvec{W}_{i}\right\}\), is to utilize the so-called sample average approximation (SAA) approach; see Kim et al. (2015) for an introduction on the subject. The SAA approach to solving Problem (5) using data equates to solving the sample problem:
$$\begin{aligned} \min _{{\gamma }\in \Gamma }\tilde{f}_{n}\left( \varvec{\gamma }\right) \equiv \frac{1}{n}\sum _{i=1}^{n}\left[ g_{1}\left( \varvec{\gamma };\varvec{W}_{i}\right) +g_{2}\left( \varvec{\gamma };\varvec{W}_{i}\right) \right] \end{aligned}$$
(6)
at the attainment of the \(n{\text {th}}\) observation to the data stream.

Due to the potential lack of convexity of the function \(g_{1}\left( \varvec{\gamma };\varvec{w}\right)\) or lack of differentiability of the function \(g_{2}\left( \varvec{\gamma };\varvec{w}\right)\), the standard SAA approach may require potentially computationally intensive approaches that may be iterative to compute a new solution \(\varvec{\gamma }^{n}\), upon the introduction of the \(n{\text {th}}\) observation. This results in an algorithmic scheme that requires iterations upon iterations and is thus not suitable for the Big Data setting.

Suppose that we can find some function \(\tilde{g}_{1}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right)\) with the following properties: (A1) \(\tilde{g}_{1}\left( \varvec{\delta },\varvec{\delta };\varvec{w}\right) =g_{1}\left( \varvec{\delta };\varvec{w}\right)\) for all \(\varvec{\delta }\in \Gamma\) and \(\varvec{w}\in \mathbb {W}\), and (A2) \(\tilde{g}_{1}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right) =g_{1}\left( \varvec{\gamma };\varvec{w}\right)\) for all \(\varvec{\gamma }\in \tilde{\Gamma }\), \(\varvec{\delta }\in \Gamma\), and \(\varvec{w}\in \mathbb {W}\). We call any function that satisfies (A1) and (A2) a majorizer of \(g_{1}\left( \varvec{\gamma };\varvec{w}\right)\). Here, the majorizer \(\tilde{g}_{1}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right)\) is assumed to be simpler to minimize than the function that it majorizes.

Given a stream of data \(\left\{ \varvec{W}_{i}\right\}\), and given the availability of a simplifying majorizer, the SMM algorithm proposes to replace the difficult problem (6) with the simpler problem:
$$\begin{aligned} \min _{\varvec{\gamma }\in \Gamma }\frac{1}{n}\sum _{i=1}^{n}\left[ \tilde{g}_{1}\left( \varvec{\gamma },\varvec{\gamma }^{i-1};\varvec{W}_{i}\right) +g_{2}\left( \varvec{\gamma };\varvec{W}_{i}\right) \right] , \end{aligned}$$
(7)
on obtaining the newest observation \(\varvec{W}_{n}\) (\(n\in \mathbb {N}\)). Here, \(\varvec{\gamma }^{i}\) (\(i<n\)) is a minimizer for Problem (7) after the introduction of the \(i{\text {th}}\) observation to the data stream. For each new observation that is introduced, we see that the majorizing surrogate function approximates the minimization problem around the previously optimal iterate that was obtained. A pseudocode of a concrete instance of the SMM framework is provided in Algorithm 1.

3 SMM algorithms for SVMs

To construct the SMM algorithms for the hinge loss, squared-hinge loss, and logistic loss SVM optimization problems, we require the following majorizers. The derivation of these facts arise in Nguyen and McLachlan (2017) and Bohning and Lindsay (1988), and thus are omitted for brevity. Throughout this section, we use \(\varvec{\gamma }\) and \(\varvec{\theta }\) interchangeably, as well as \(\varvec{w}\) and \(\varvec{z}\).

Fact 1

For any \(\epsilon >0\), the function \(g\left( \varvec{\gamma };\varvec{w}\right) =\frac{1}{2}\sqrt{\left[ h\left( \varvec{\gamma };\varvec{w}\right) \right] ^{2}+\epsilon }+\frac{1}{2}h\left( \varvec{\gamma };\varvec{w}\right)\), for any real-valued function \(h\left( \varvec{\gamma };\varvec{w}\right)\), can be majorized by
$$\begin{aligned} \tilde{g}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right)&=\frac{1}{4}\frac{\left[ h\left( \varvec{\gamma };\varvec{w}\right) +\sqrt{\left[ h\left( \varvec{\delta };\varvec{w}\right) \right] ^{2}+\epsilon }\right] ^{2}+\epsilon }{\sqrt{\left[ h\left( \varvec{\delta };\varvec{w}\right) \right] ^{2}+\epsilon }}, \end{aligned}$$
at any valid inputs \(\varvec{\delta }\) and \(\varvec{w}\).

Fact 2

Let \(g\left( \varvec{\gamma };\varvec{w}\right)\) be a real-valued function that is twice-differentiable in \(\varvec{\gamma }\) for each valid input \(\varvec{w}\). Let \(\mathbf {H}\) be the Hessian of \(g\left( \varvec{\gamma };\varvec{w}\right)\). If \(\mathbf {H}-\partial g\left( \varvec{\gamma };\varvec{w}\right) /\partial \varvec{\gamma }\partial \varvec{\gamma }^{\top }\) is positive definite, for all \(\varvec{\gamma }\in \Gamma\) and fixed \(\varvec{w}\), then \(g\left( \varvec{\gamma };\varvec{w}\right)\) can be majorized by
$$\begin{aligned} \tilde{g}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right) =g\left( \varvec{\delta };\varvec{w}\right) +\left. \frac{\partial g\left( \varvec{u};\varvec{w}\right) }{\partial \varvec{u}}\right| _{\varvec{u}=\varvec{\delta }}\left( \varvec{\gamma }-\varvec{\delta }\right) +\frac{1}{2}\left( \varvec{\gamma }-\varvec{\delta }\right) ^{\top }\mathbf {H}\left( \varvec{\gamma }-\varvec{\delta }\right) , \end{aligned}$$
at any valid inputs \(\varvec{\delta }\) and \(\varvec{w}\).

3.1 Hinge loss SMM

In the hinge loss case, the \(n{\text {th}}\) step sub-problem for solving Problem (4) is obtained by making the substitutions \(g_{1}\left( \varvec{\theta };\varvec{z}\right) =l_{\text {H}}\left( \varvec{\theta };\varvec{z}\right)\) and \(g_{2}\left( \varvec{\theta };\varvec{z}\right) =\lambda \varvec{\beta }^{\top }\varvec{\beta }\). Note that \(g_{1}\left( \varvec{\theta };\varvec{z}\right)\) does not meet the the twice continuously differentiable assumption that is required for the construction of an SMM. Using a small \(\epsilon >0\), we thus approximate \(g_{1}\left( \varvec{\theta };\varvec{z}\right)\) by the twice continuously differentiable function:
$$\begin{aligned} g_{1}^{*}\left( \varvec{\theta };\varvec{z}\right) =\frac{1}{2}\sqrt{\left( 1-y\tilde{\varvec{x}}^{\top }\varvec{\theta }\right) ^{2}+\epsilon }+\frac{1}{2}\left( 1-y\tilde{\varvec{x}}^{\top }\varvec{\theta }\right) , \end{aligned}$$
via the identity \(\max \left\{ \gamma ,\delta \right\} =\left| \gamma -\delta \right| /2+\gamma /2+\delta /2\) for \(\gamma ,\delta \in \mathbb {R}\).
Using Fact 1, we can majorize \(g_{1}^{*}\left( \varvec{\theta };\varvec{z}\right)\) at \(\varvec{\delta }\in \Theta\) for fixed \(\varvec{z}\), by
$$\begin{aligned} \tilde{g}_{1}^{*}\left( \varvec{\theta },\varvec{\delta };\varvec{z}\right) =\frac{1}{4}\frac{\left[ 1-\tilde{\varvec{y}}^{\top }\varvec{\theta }+\omega \left( \varvec{\delta };\varvec{z}\right) \right] ^{2}+\epsilon }{\omega \left( \varvec{\delta };\varvec{z}\right) }, \end{aligned}$$
(8)
where \(\tilde{\varvec{y}}=y\tilde{\varvec{x}}\) and \(\omega \left( \varvec{\theta };\varvec{z}\right) =\sqrt{\left( 1-\tilde{\varvec{y}}^{\top }\varvec{\theta }\right) ^{2}+\epsilon }\). Using majorizer (8) and the sequence of minimizes up to the \(\left( n-1\right) \text {th}\) iteration \(\left\{ \varvec{\theta }^{i}\right\}\) and a realized sequence up to the \(n\text {th}\) observation \(\left\{ \varvec{z}_{i}\right\}\), we can write the \(n\text {th}\) iteration sub-problem of the SMM algorithm as
$$\begin{aligned} \underset{\varvec{\theta }\in \Theta }{\min }f_{\text {H}}\left( \varvec{\theta }\right) \equiv \frac{1}{n}\sum _{i=1}^{n}\left[ \tilde{g}_{1}^{*}\left( \varvec{\theta },\varvec{\theta }^{i-1};\varvec{z}_{i}\right) +g_{2}\left( \varvec{\theta };\varvec{z}_{i}\right) \right] \end{aligned}$$
(9)
where we can write
$$\begin{aligned} f_{\text {H}}\left( \varvec{\theta }\right)&= \frac{1}{4n}\sum _{i=1}^{n}\frac{\left[ 1-\tilde{\varvec{y}}^{\top }\varvec{\theta }+\omega \left( \varvec{\theta }^{i-1};\varvec{z}_{i}\right) \right] ^{2}+\epsilon }{\omega \left( \varvec{\theta }^{i-1};\varvec{z}_{i}\right) }+\frac{\lambda }{n}\sum _{i=1}^{n}\varvec{\beta }^{\top }\varvec{\beta }\\&= \frac{1}{4n}\left( \mathbf {1}_{n}+\varvec{\omega }_{n}-\tilde{\mathbf {Y}}_{n}\varvec{\theta }\right) \varvec{\Omega }_{n}^{-1}\left( \mathbf {1}_{n}+\varvec{\omega }_{n}-\tilde{\mathbf {Y}}_{n}\varvec{\theta }\right) \\&\quad +\lambda \varvec{\theta }^{\top }\tilde{\mathbf {I}}_{p}\varvec{\theta }+\frac{\epsilon }{4n}{\text {tr}}\left( \varvec{\Omega }_{n}^{-1}\right) , \end{aligned}$$
via the substitutions \(\varvec{\omega }_{n}^{\top }=\left( \omega \left( \varvec{\theta }^{0};\varvec{z}_{1}\right) ,\dots ,\omega \left( \varvec{\theta }^{n-1};\varvec{z}_{n}\right) \right)\), \(\varvec{\Omega }_{n}={\text {diag}}\left( \varvec{\omega }_{n}\right)\), and \(\tilde{\mathbf {Y}}_{n}\in \mathbb {R}^{n\times p}\) is a matrix with rows \(\tilde{\varvec{y}}^{\top }\). Here, \(\varvec{1}_{n}\in \mathbb {R}^{n}\) is a vector of ones:
$$\begin{aligned} \tilde{\mathbf {I}}_{p}=\left[ \begin{array}{cc} 0 &{} 0\\ 0 &{} \mathbf {I}_{p} \end{array}\right] , \end{aligned}$$
where \(\mathbf {I}_{p}\in \mathbb {R}^{p\times p}\) is the identity matrix, \({\text {diag}}\left( \cdot \right)\) converts a vector into a square matrix with the vector values along the leading diagonal and zeros elsewhere, and \({\text {tr}}\left( \cdot \right)\) is the matrix trace operator.
We observe that Problem (9) is a quadratic minimization problem by the definition of \(\varvec{\Omega }_{n}\), and thus, we can solve it using the standard first-order condition (FOC) of multivariate calculus. The partial derivative of \(f\left( \varvec{\theta }\right)\) is
$$\begin{aligned} \frac{\partial f_{\text {H}}\left( \varvec{\theta }\right) }{\partial \varvec{\theta }}=-\frac{1}{2n}\tilde{\mathbf {Y}}_{n}^{\top }\varvec{\Omega }_{n}^{-1}\left( \mathbf {1}_{n}+\varvec{\omega }_{n}-\tilde{\mathbf {Y}}_{n}\varvec{\theta }\right) +2\lambda \tilde{\mathbf {I}}_{p}\varvec{\theta }. \end{aligned}$$
Upon setting \(\partial f_{\text {H}}\left( \varvec{\theta }\right) /\partial \varvec{\theta }=\mathbf {0}_{p}\), where \(\mathbf {0}_{p}\in \mathbb {R}^{p}\) is a vector of zeros, we obtain the root and FOC solution
$$\begin{aligned} \varvec{\theta }_{\text {H}}^{n}=\left( \tilde{\mathbf {Y}}_{n}^{\top }\varvec{\Omega }_{n}^{-1}\tilde{\mathbf {Y}}_{n}+4\lambda n\tilde{\mathbf {I}}_{p}\right) ^{-1}\tilde{\mathbf {Y}}_{n}^{\top }\varvec{\Omega }_{n}^{-1}\left( \mathbf {1}_{n}+\varvec{\omega }_{n}\right) . \end{aligned}$$
(10)
Thus, the \(n\text {th}\) iteration update in Algorithm 1 for the approximate SVM problem with hinge loss is to set \(\varvec{\theta }^{n}\leftarrow \varvec{\theta }_{\text {H}}^{n}\). We observe that the derived algorithm has a WLS form as each observation \(\varvec{z}_{i}\) is weighted (once) by the value \(\omega \left( \varvec{\theta }^{i-1};\varvec{z}_{i}\right)\), within the vector \(\varvec{\omega }_{n}\). This results in a lesser computational burden than the iterative reweighting of each observation in the IRLS algorithm of Nguyen and McLachlan (2017) for the same SVM variant.

3.2 Squared-hinge loss SMM

The squared-hinge loss case \(n\text {th}\) step sub-problem for solving Problem (4) is obtained by making the substitutions \(g_{1}\left( \varvec{\theta };\varvec{z}\right) =l_{\text {S}}\left( \varvec{\theta };\varvec{z}\right)\) and \(g_{2}\left( \varvec{\theta };\varvec{z}\right) =\lambda \varvec{\beta }^{\top }\varvec{\beta }\). Although \(l_{\text {S}}\left( \varvec{\theta };\varvec{z}\right)\) is differentiable in \(\varvec{\theta }\) for any \(\varvec{z}\), it still does not meet the twice continuously differentiable criterion required by the SMM algorithm framework. We must thus devise an approximation for the squared-hinge loss that is suitable for our purpose.

Using the same identity as that which was used in “Hinge loss SMM”, we can write \(\left[ \gamma \right] _{+}^{2}=\left( \left| \gamma \right| /2+\gamma /2\right) ^{2}\), which can then be expanded to \(\left[ \gamma \right] _{+}^{2}=\gamma ^{2}/2+\gamma \left| \gamma \right| /2\). In this form, it is easy to see that for any small \(\epsilon >0\), we can approximate \(g\left( \gamma \right) =\left[ \gamma \right] _{+}^{2}\) by \(g_{\epsilon }\left( \gamma \right) =\left( \gamma ^{2}+\epsilon \right) /2+\gamma \sqrt{\gamma ^{2}+\epsilon }/2\).

Note the desirable property that \(g_{\epsilon }\left( \gamma \right) >0\) for any choice of \(\epsilon >0\), for all \(\gamma \in \mathbb {R}\) since, \(\lim _{\gamma \rightarrow -\infty }g_{\epsilon }\left( \gamma \right) =\epsilon /4\), \(\lim _{\gamma \rightarrow \infty }g_{\epsilon }\left( \gamma \right) =\infty\), and \({\text {d}}g_{\epsilon }\left( \gamma \right) /{\text {d}}\gamma =\left( \sqrt{\gamma ^{2}+\epsilon }+\gamma \right) ^{2}/\left( 2\sqrt{\gamma ^{2}+\epsilon }\right)\) is always positive.

Next, consider that the second and third derivatives can be written as \({\text {d}}g_{\epsilon }^{2}(\gamma )/{\text {d}}^{2}\gamma =1+3\gamma /(2\sqrt{\gamma ^{2}+\epsilon })-\gamma ^{3}/[2(\gamma ^{2}+\epsilon )^{3/2}]\), and \({\text {d}}g_{\epsilon }^{3}(\gamma )/{\text {d}}^{3}\gamma =3\epsilon ^{2}/[2(\gamma ^{2}+\epsilon )^{5/2}]\). Now, notice that the third derivative is positive, and hence, the second derivative is strictly increasing. Furthermore, the second derivative has limits 0 and 2 as \(\gamma\) approaches \(-\infty\) and \(\infty\), respectively, for any \(\epsilon >0\). Thus, this implies that \(g_{\epsilon }(\gamma )\) is convex and that its second derivative can be upper bounded by 2, for any \(\epsilon\). Using Fact 2, we can then majorize \(g_{\epsilon }(\gamma )\) by
$$\begin{aligned} \tilde{g}_{\epsilon }\left( \gamma ,\delta \right) =g_{\epsilon }\left( \delta \right) +\frac{\left( \sqrt{\delta ^{2}+\epsilon }+\delta \right) ^{2}}{2\sqrt{\delta ^{2}+\epsilon }}\left( \gamma -\delta \right) +\left( \gamma -\delta \right) ^{2}, \end{aligned}$$
(11)
for any \(\rho \ge 1\), and at any valid value of \(\delta\).
Using \(g_{\epsilon }(\gamma )\), we can approximate the squared-hinge loss function \(l_{\text {S}}\left( \varvec{\theta };\varvec{z}\right)\) by
$$\begin{aligned} g_{1}^{*}\left( \varvec{\theta };\varvec{z}\right) =\frac{\left( 1-\tilde{\varvec{y}}^{\top }\varvec{\theta }\right) ^{2}+\epsilon }{2}+\frac{\left( 1-\tilde{\varvec{y}}^{\top }\varvec{\theta }\right) \sqrt{\left( 1-\tilde{\varvec{y}}^{\top }\varvec{\theta }\right) ^{2}+\epsilon }}{2}, \end{aligned}$$
for any \(\epsilon >0\). We can then use (11) to produce the majorizer:
$$\begin{aligned} \tilde{g}_{1}^{*}\left( \varvec{\theta },\varvec{\delta };\varvec{z}\right)= & {} g_{1}^{*}\left( \varvec{\delta };\varvec{z}\right) -\frac{\left( \sqrt{\left( 1-\tilde{\varvec{y}}^{\top }\varvec{\delta }\right) ^{2}+\epsilon }+1-\tilde{\varvec{y}}^{\top }\varvec{\delta }\right) ^{2}}{2\sqrt{\left( 1-\tilde{\varvec{y}}^{\top }\varvec{\delta }\right) ^{2}+\epsilon }}\tilde{\varvec{y}}^{\top }\left( \varvec{\theta }-\varvec{\delta }\right) \nonumber \\&+\left( \varvec{\theta }-\varvec{\delta }\right) ^{\top }\tilde{\varvec{y}}\tilde{\varvec{y}}^{\top }\left( \varvec{\theta }-\varvec{\delta }\right) , \end{aligned}$$
(12)
at any valid values \(\varvec{\delta }\) and \(\varvec{z}\).
Using majorizer (12) and the sequence of minimizes up to the \(\left( n-1\right) \text {th}\) iteration \(\left\{ \varvec{\theta }^{i}\right\}\) and a realized sequence up to the \(n\text {th}\) observation \(\left\{ \varvec{z}_{i}\right\}\), we can write the \(n\text {th}\) iteration sub-problem of the SMM algorithm as
$$\begin{aligned} \underset{\varvec{\theta }\in \Theta }{\min }f_{\text {S}}\left( \varvec{\theta }\right) \equiv \frac{1}{n}\sum _{i=1}^{n}\left[ \tilde{g}_{1}^{*}\left( \varvec{\theta },\varvec{\theta }^{i-1};\varvec{z}_{i}\right) +g_{2}\left( \varvec{\theta };\varvec{z}_{i}\right) \right] , \end{aligned}$$
(13)
where we can write
$$\begin{aligned} f_{\text {S}}\left( \varvec{\theta }\right)&= \frac{1}{n}\sum _{i=1}^{n}\left[ g_{1}^{*}\left( \varvec{\theta }^{i-1};\varvec{z}_{i}\right) -\psi \left( \varvec{\theta }^{i-1};\varvec{z}_{i}\right) \tilde{\varvec{y}}_{i}^{\top }\left( \varvec{\theta }-\varvec{\theta }^{i-1}\right) \right. \\&\quad +\left. \left( \varvec{\theta }-\varvec{\theta }^{i-1}\right) ^{\top }\tilde{\varvec{y}}_{i}\tilde{\varvec{y}}_{i}^{\top }\left( \varvec{\theta }-\varvec{\theta }^{i-1}\right) \right] +\frac{\lambda }{n}\sum _{i=1}^{n}\varvec{\beta }^{\top }\varvec{\beta }\\&= -\frac{1}{n}\varvec{\psi }_{n}^{\top }\tilde{\mathbf {Y}}_{n}\varvec{\theta }+\frac{\rho }{n}\sum _{i=1}^{n}\left( \varvec{\theta }-\varvec{\theta }^{i-1}\right) ^{\top }\tilde{\varvec{y}}_{i}\tilde{\varvec{y}}_{i}^{\top }\left( \varvec{\theta }-\varvec{\theta }^{i-1}\right) \\&\quad +\lambda \varvec{\theta }^{\top }\tilde{\mathbf {I}}_{p}\varvec{\theta }+\frac{1}{n}\sum _{i=1}^{n}g_{1}^{*}\left( \varvec{\theta }^{i-1};\varvec{z}_{i}\right) +\frac{1}{n}\sum _{i=1}^{n}\psi \left( \varvec{\theta }^{i-1};\varvec{z}_{i}\right) \tilde{\varvec{y}}_{i}^{\top }\varvec{\theta }^{i-1}, \end{aligned}$$
and where
$$\begin{aligned} \psi \left( \varvec{\theta };\varvec{z}\right) =\frac{\left( \sqrt{\left( 1-\tilde{\varvec{y}}^{\top }\varvec{\theta }\right) ^{2}+\epsilon }+1-\tilde{\varvec{y}}^{\top }\varvec{\theta }\right) ^{2}}{2\sqrt{\left( 1-\tilde{\varvec{y}}^{\top }\varvec{\theta }\right) ^{2}+\epsilon }}, \end{aligned}$$
and \(\varvec{\psi }_{n}^{\top }=\left( \psi \left( \varvec{\theta }^{0};\varvec{z}_{1}\right) ,\dots ,\psi \left( \varvec{\theta }^{n-1};\varvec{z}_{n}\right) \right)\). Since \(f_{\text {S}}\left( \varvec{\theta }\right)\) is again a quadratic, we solve for the FOC \(\partial f_{\text {S}}\left( \varvec{\theta }\right) /\partial \varvec{\theta }=\mathbf {0}_{p}\) to obtain the minimum
$$\begin{aligned} \varvec{\theta }_{\text {S}}^{n}=\left( \tilde{\mathbf {Y}}_{n}^{\top }\tilde{\mathbf {Y}}_{n}+\lambda n\tilde{\mathbf {I}}_{p}\right) ^{-1}\left( \sum _{i=1}^{n}\tilde{\varvec{y}}_{i}\tilde{\varvec{y}}_{i}^{\top }\varvec{\theta }^{i-1}+\frac{1}{2}\tilde{\mathbf {Y}}_{n}^{\top }\varvec{\psi }_{n}\right) . \end{aligned}$$
(14)
Thus, the \(n\text {th}\) iteration update in Algorithm 1 for the approximate SVM problem with squared-hinge loss is to set \(\varvec{\theta }^{n}\leftarrow \varvec{\theta }_{\text {S}}^{n}\). Note that the iterations are in an LS-like form, where upon we compute \(\psi \left( \varvec{\theta }^{n-1};\varvec{z}_{n}\right)\) only once, for each newly introduced datum \(\varvec{z}_{n}\) into the stream \(\left\{ \varvec{z}_{i}\right\}\).

3.3 Logistic loss SMM

The logistic loss case \(n\text {th}\) step sub-problem for solving Problem (4) is obtained by making the substitutions \(g_{1}\left( \varvec{\theta };\varvec{z}\right) =l_{\text {L}}\left( \varvec{\theta };\varvec{z}\right)\) and \(g_{2}\left( \varvec{\theta };\varvec{z}\right) =\lambda \varvec{\beta }^{\top }\varvec{\beta }\). Fortunately, unlike the hinge and the squared-hinge loss, the logistic loss \(l_{\text {L}}\left( \varvec{\theta };\varvec{z}\right)\) is twice continuously differentiable and does not require approximation. We now seek to obtain a majorizer for \(g_{1}\left( \varvec{\theta };\varvec{z}\right)\).

The partial derivatives and Hessian of \(g_{1}\left( \varvec{\theta };\varvec{z}\right) =\log \left[ 1+\exp \left( -\tilde{\varvec{y}}^{\top }\varvec{\theta }\right) \right]\) can be written as
$$\begin{aligned} \frac{\partial g_{1}\left( \varvec{\theta };\varvec{z}\right) }{\partial \varvec{\theta }}=-\chi \left( \varvec{\theta };\varvec{z}\right) \tilde{\varvec{y}} \end{aligned}$$
and
$$\begin{aligned} \frac{\partial ^{2}g_{1}\left( \varvec{\theta };\varvec{z}\right) }{\partial \varvec{\theta }\partial \varvec{\theta }^{\top }}=\chi \left( \varvec{\theta };\varvec{z}\right) \left[ 1-\chi \left( \varvec{\theta };\varvec{z}\right) \right] \tilde{\varvec{y}}\tilde{\varvec{y}}^{\top }, \end{aligned}$$
where \(\chi \left( \varvec{\theta };\varvec{z}\right) =\exp \left( -\tilde{\varvec{y}}^{\top }\varvec{\theta }\right) /\left[ 1+\exp \left( -\tilde{\varvec{y}}^{\top }\varvec{\theta }\right) \right]\). Note that \(\chi \left( \varvec{\theta };\varvec{z}\right) \left[ 1-\chi \left( \varvec{\theta };\varvec{z}\right) \right] \le 1/4\) since \(0<\chi \left( \varvec{\theta };\varvec{z}\right) <1\). Thus, we obtain the positive definiteness of \(\mathbf {H}-\partial ^{2}g_{1}\left( \varvec{\theta };\varvec{z}\right) /\partial \varvec{\theta }\partial \varvec{\theta }^{\top }\), where \(\mathbf {H}=\tilde{\varvec{y}}\tilde{\varvec{y}}^{\top }/4\). Upon application of Fact 2, we have the majorizer:
$$\begin{aligned} \tilde{g}_{1}\left( \varvec{\theta },\varvec{\delta };\varvec{z}\right)= & {} g_{1}\left( \varvec{\delta };\varvec{z}\right) -\chi \left( \varvec{\theta };\varvec{z}\right) \tilde{\varvec{y}}^{\top }\left( \varvec{\theta }-\varvec{\delta }\right) \nonumber \\+ & {} \frac{1}{8}\left( \varvec{\theta }-\varvec{\delta }\right) ^{\top }\tilde{\varvec{y}}\tilde{\varvec{y}}^{\top }\left( \varvec{\theta }-\varvec{\delta }\right) , \end{aligned}$$
(15)
for \(g_{1}\left( \varvec{\theta };\varvec{z}\right)\), at any valid \(\varvec{\delta }\) and \(\varvec{z}\).
Using majorizer (15) and the sequence of minimizes up to the \(\left( n-1\right) \text {th}\) iteration \(\left\{ \varvec{\theta }^{i}\right\}\) and a realized sequence up to the \(n\text {th}\) observation \(\left\{ \varvec{z}_{i}\right\}\), we can write the \(n\text {th}\) iteration sub-problem of the SMM algorithm as
$$\begin{aligned} \underset{\varvec{\theta }\in \Theta }{\min }f_{\text {L}}\left( \varvec{\theta }\right) \equiv \frac{1}{n}\sum _{i=1}^{n}\left[ \tilde{g}_{1}\left( \varvec{\theta },\varvec{\theta }^{i-1};\varvec{z}_{i}\right) +g_{2}\left( \varvec{\theta };\varvec{z}_{i}\right) \right] , \end{aligned}$$
(16)
where we can write
$$\begin{aligned} f_{\text {L}}\left( \varvec{\theta }\right)&= \frac{1}{n}\sum _{i=1}^{n}\left[ g_{1}\left( \varvec{\theta }^{i-1};\varvec{z}_{i}\right) -\chi \left( \varvec{\theta }^{i-1};\varvec{z}_{i}\right) \tilde{\varvec{y}}_{i}^{\top }\left( \varvec{\theta }-\varvec{\theta }^{i-1}\right) \right. \\&\quad + \left. \frac{1}{8}\left( \varvec{\theta }-\varvec{\theta }^{i-1}\right) ^{\top }\tilde{\varvec{y}}_{i}\tilde{\varvec{y}}_{i}^{\top }\left( \varvec{\theta }-\varvec{\theta }^{i-1}\right) \right] \\&\quad +\frac{\lambda }{n}\sum _{i=1}^{n}\varvec{\beta }^{\top }\varvec{\beta }\\&= -\frac{1}{n}\varvec{\chi }_{n}^{\top }\tilde{\mathbf {Y}}_{n}\varvec{\theta }+\frac{1}{8n}\left( \varvec{\theta }-\varvec{\theta }^{i-1}\right) ^{\top }\tilde{\mathbf {Y}}_{n}^{\top }\tilde{\mathbf {Y}}_{n}\left( \varvec{\theta }-\varvec{\theta }^{i-1}\right) \\&\quad +\lambda \varvec{\theta }^{\top }\tilde{\mathbf {I}}_{p}\varvec{\theta }+\frac{1}{n}\sum _{i=1}^{n}g_{1}\left( \varvec{\theta }^{i-1};\varvec{z}_{i}\right) +\frac{1}{n}\sum _{i=1}^{n}\chi \left( \varvec{\theta }^{i-1};\varvec{z}_{i}\right) \tilde{\varvec{y}}_{i}^{\top }\varvec{\theta }^{i-1}, \end{aligned}$$
and \(\varvec{\chi }_{n}^{\top }=\left( \chi \left( \varvec{\theta }^{0};\varvec{z}_{1}\right) ,\dots ,\chi \left( \varvec{\theta }^{n-1};\varvec{z}_{n}\right) \right)\).
Again, given the quadric form of \(f_{\text {L}}\left( \varvec{\theta }\right)\), we can obtain a minimum by finding the root of the FOC \(\partial f_{\text {L}}\left( \varvec{\theta }\right) /\partial \varvec{\theta }=\mathbf {0}_{p}\). The FOC root is
$$\begin{aligned} \varvec{\theta }_{\text {L}}^{n}=\left( \tilde{\mathbf {Y}}_{n}^{\top }\tilde{\mathbf {Y}}_{n}+8\lambda n\tilde{\mathbf {I}}_{p}\right) ^{-1}\left( \sum _{i=1}^{n}\tilde{\varvec{y}}_{i}\tilde{\varvec{y}}_{i}^{\top }\varvec{\theta }^{i-1}+4\tilde{\mathbf {Y}}_{n}^{\top }\varvec{\chi }_{n}\right) , \end{aligned}$$
(17)
which yields the \(n\text {th}\) iteration update in Algorithm 1 for the SVM problem with logistic loss is to set \(\varvec{\theta }^{n}\leftarrow \varvec{\theta }_{\text {L}}^{n}\). Again, observe that the the algorithm is in LS-like form and requires the computation of the value \(\chi \left( \varvec{\theta }^{n-1};\varvec{z}_{n}\right)\) only once per newly introduced datum \(\varvec{z}_{n}\) into the stream \(\left\{ \varvec{z}_{i}\right\}\).

3.4 Computational remarks

Note that the expressions \(\tilde{\mathbf {Y}}_{n}^{\top }\varvec{\Omega }_{n}^{-1}\tilde{\mathbf {Y}}_{n}\) and \(\tilde{\mathbf {Y}}_{n}^{\top }\varvec{\Omega }_{n}^{-1}\left( \mathbf {1}_{n}+\varvec{\omega }_{n}\right)\) in (10) can be rewritten as
$$\begin{aligned} \sum _{i=1}^{n}\omega ^{-1}\left( \varvec{\theta }^{i-1};\varvec{z}_{i}\right) \tilde{\varvec{y}}_{i}\tilde{\varvec{y}}_{i}^{\top }{\text { and }}\sum _{i=1}^{n}\frac{1+\omega ^{-1}\left( \varvec{\theta }^{i-1};\varvec{z}_{i}\right) }{\omega \left( \varvec{\theta }^{i-1};\varvec{z}_{i}\right) }\tilde{\varvec{y}}_{i}, \end{aligned}$$
respectively. This implies that the computation of (10) only requires the singular access of observation \(\varvec{z}_{n}\) at iteration n, and at no other iteration if one saves the previous iterate of the above given sums (i.e., \(\tilde{\mathbf {Y}}_{n-1}^{\top }\varvec{\Omega }_{n-1}^{-1}\tilde{\mathbf {Y}}_{n-1}\) and \(\tilde{\mathbf {Y}}_{n-1}^{\top }\varvec{\Omega }_{n-1}^{-1}\left( \mathbf {1}_{n-1}+\varvec{\omega }_{-1}\right)\)).
Similar to the comment above, we can write \(\tilde{\mathbf {Y}}_{n}^{\top }\tilde{\mathbf {Y}}_{n}\) and \(\tilde{\mathbf {Y}}_{n}^{\top }\varvec{\psi }_{n}\) of (14) as
$$\begin{aligned} \sum _{i=1}^{n}\tilde{\varvec{y}}_{i}\tilde{\varvec{y}}_{i}^{\top }{\text { and }}\sum _{i=1}^{n}\psi \left( \varvec{\theta }^{i-1};\varvec{z}_{i}\right) \tilde{\varvec{y}}_{i}, \end{aligned}$$
respectively, and we can write \(\tilde{\mathbf {Y}}_{n}^{\top }\varvec{\chi }_{n}\) as \(\sum _{i=1}^{n}\chi \left( \varvec{\theta }^{i-1};\varvec{z}_{i}\right) \tilde{\varvec{y}}_{i}\). Thus, both the computation of (14) and of (17) require a singular access of the observation \(\varvec{z}_{n}\) at iteration n.

Therefore, in all three of the algorithms above, one is not required to store the entire stream \(\left\{ \varvec{z}_{i}\right\}\) to compute the \(n\text {th}\) iteration of the algorithm. This is a significant memory and computational advantage when comparing the SMM algorithms to their batch counterparts.

4 Convergence analysis

4.1 General result

Repeating what has been stated in “Stochastic MM algorithm”, we formally state our assumptions on the functional components of \(f\left( \varvec{\gamma }\right)\) in Problem (5):
  1. (B1)

    The function \(f\left( \varvec{\gamma }\right)\) is real valued and takes inputs \(\varvec{\gamma }\in \Gamma\), where \(\Gamma\) is a compact and convex set.

     
  2. (B2)

    The function \(g_{1}\left( \varvec{\gamma };\varvec{w}\right)\) is twice continuously differentiable in \(\varvec{\gamma }\in \tilde{\Gamma }\), for each \(\varvec{w}\in \mathbb {W}\), where \(\tilde{\Gamma }\) is a bounded and open set such that \(\Gamma \subset \tilde{\Gamma }\).

     
  3. (B3)

    The function \(g_{2}\left( \varvec{\gamma };\varvec{w}\right)\) is convex and continuous in \(\varvec{\gamma }\in \Gamma\), for each \(\varvec{w}\in \mathbb {W}\).

     
The next set of assumptions pertain to the properties that are required of the majorizer \(\tilde{g}_{1}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right)\) of \(g_{1}\left( \varvec{\gamma };\varvec{w}\right)\), at \(\varvec{\delta }\) and \(\varvec{w}\), to obtain convergence results for Algorithm 1:
  1. (C1)

    The function \(\tilde{g}_{1}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right)\) is real valued, takes inputs \(\varvec{\gamma }\in \tilde{\Gamma }\), and majorities \(g_{1}\left( \varvec{\gamma };\varvec{w}\right)\) at \(\varvec{\delta }\in \tilde{\Gamma }\) and \(\varvec{w}\in \mathbb {W}\) in the sense that (A1) and (A2) are satisfied.

     
  2. (C2)

    The function \(\tilde{g}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right) =\tilde{g}_{1}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right) +g_{2}\left( \varvec{\gamma };\varvec{w}\right)\) is uniformly strongly convex in \(\varvec{\gamma }\in \Gamma\), in the sense that for all valid \(\varvec{\gamma },\varvec{\delta }\in \Gamma\) and \(\varvec{w}\in \mathbb {W}\), there exists a constant \(\mu >0\), such that \(\tilde{g}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right) -\frac{\mu }{2}\left( \varvec{\gamma }-\tilde{\varvec{\gamma }}\right) ^{\top }\left( \varvec{\gamma }-\tilde{\varvec{\gamma }}\right)\) is convex, for all \(\tilde{\varvec{\gamma }}\in \Gamma\) (cf. Mairal 2015).

     
Although we allow \(g_{1}\left( \varvec{\gamma };\varvec{w}\right)\) to potentially not be convex, and we allow \(g_{2}\left( \varvec{\gamma };\varvec{w}\right)\) to potentially not be smooth, we still require the following regularity conditions. Let \(\left\| \cdot \right\|\) denote an appropriate matrix or vector norm, and let the directional derivative of a function \(h\left( \varvec{\gamma }\right)\) that takes input \(\varvec{\gamma }\in \Gamma\), in the direction \(\varvec{v}\in \mathbb {R}^{d}\), be defined as
$$\begin{aligned} {\text {d}}_{\varvec{v}}h\left( \varvec{\gamma }\right) =\lim _{t\downarrow 0}\frac{h\left( \varvec{\gamma }+t\varvec{v}\right) -h\left( \varvec{\gamma }\right) }{t}, \end{aligned}$$
where \({\text {d}}_{\varvec{v}}h\left( \varvec{\gamma }\right) \equiv \infty\) by definition, if \(\varvec{\gamma }+t\varvec{v}\in \Gamma\), for all \(t>0\).
  1. (D1)

    The function \(\tilde{g}_{1}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right)\) is continuous in \(\varvec{\gamma }\in \tilde{\Gamma }\), for fixed \(\varvec{\delta }\in \tilde{\Gamma }\) and \(\varvec{w}\in \mathbb {W}\).

     
  2. (D2)
    The functions \(g_{1}\left( \varvec{\gamma };\varvec{w}\right)\) and \(\tilde{g}_{1}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right)\) and its derivatives are uniformly bounded, in the sense that there exists a constant \(K_{1}>0\), such that \(\left| g_{1}\left( \varvec{\gamma };\varvec{w}\right) \right| \le K_{1}\), \(\left| \tilde{g}_{1}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right) \right| \le K_{1}\),
    $$\begin{aligned}&\left\| \frac{\partial g_{1}\left( \varvec{\gamma };\varvec{w}\right) }{\partial \varvec{\gamma }}\right\| \le K_{1},\; \left\| \frac{\partial \tilde{g}_{1}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right) }{\partial \varvec{\gamma }}\right\| \le K_{1}\text {, and, }\\&\left\| \frac{\partial ^{2}\tilde{g}_{1}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right) }{\partial \varvec{\gamma }\partial \varvec{\gamma }^{\top }}\right\| \le K_{1}, \end{aligned}$$
    for every combination of valid \(\varvec{\gamma },\varvec{\delta }\in \tilde{\Gamma }\) and \(\varvec{w}\in \mathbb {W}\).
     
  3. (D3)

    The function \(g_{2}\left( \varvec{\gamma };\varvec{w}\right)\) and its directional derivative are uniformly bounded, in the sense that there exists a constant \(K_{2}>0\) such that \(\left| g_{2}\left( \varvec{\gamma };\varvec{w}\right) \right| \le K_{2}\) and \(\left| {\text {d}}_{\varvec{v}}g_{2}\left( \varvec{\gamma };\varvec{w}\right) \right| \le K_{2}\left\| \varvec{v}\right\|\), for all valid inputs \(\varvec{\gamma }\in \Gamma\) and \(\varvec{w}\in \mathbb {W}\), and valid directions \(\varvec{v}\in \mathbb {R}^{d}\) such that \(\varvec{\gamma }+\varvec{v}\in \Gamma\).

     
  4. (D4)

    There exists a constant \(G\ge 0\) such that \(\left| \tilde{g}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right) \right| \le G\), for all valid \(\varvec{\gamma },\varvec{\delta }\in \Gamma\) and \(\varvec{w}\in \mathbb {W}\), where \(\tilde{g}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right)\) is as defined in (C4).

     
Define \(\varvec{\gamma }^{*}\) as a stationary point of Problem (5) if it satisfies the criterion \({\text {d}}_{\varvec{v}}f\left( \varvec{\gamma }^{*}\right) \ge 0\) for all \(\varvec{v}\in \Gamma\). That is, the directional derivative is non-negative for every feasible direction at the point \(\varvec{\gamma }^{*}\). Furthermore, define the distance between a point \(\varvec{\gamma }\in \Gamma\) to some set \(\mathbb {S}\subset \mathbb {R}^{d}\) as \({\text {dist}}\left( \varvec{\gamma },\mathbb {S}\right) \equiv \inf _{\varvec{s}\in \mathbb {S}}\left\| \varvec{\gamma }-\varvec{s}\right\| _{2}\), where \(\left\| \cdot \right\| _{2}\) is the Euclidean norm. We are now ready to state the general convergence result of Razaviyayn et al. (2016) for Algorithm 1.

Theorem 1

If assumptions (B1)(B3), (C1), (C2), and (D1)(D4) hold for functions \(g_{1}\left( \varvec{\gamma };\varvec{z}\right)\), \(g_{2}\left( \varvec{\gamma };\varvec{z}\right)\), and \(\tilde{g}_{1}\left( \varvec{\gamma },\varvec{\delta };\varvec{z}\right)\), then the sequence of iterates \(\left\{ \varvec{\gamma }^{i}\right\}\) of the SMM algorithm (Algorithm 1) converges to the set of stationary points of Problem (5) in the sense that
$$\begin{aligned} \lim _{n\rightarrow \infty }{\text {dist}}\left( \varvec{\gamma }^{n},\Gamma ^{*}\right) , \end{aligned}$$
where \(\Gamma ^{*}\) is the set of stationary points of Problem (5).

Theorem 1 provides a powerful convergence result that allows practitioners to be confident that the algorithm will produce meaningful results, provided enough data is made available through the input stream.

4.2 Application to the SMM algorithms for SVMs

We wish to apply Theorem 1 to conclude convergence results for our SMM algorithms for solving the approximate hinge loss, approximate squared-hinge loss, and the logistic loss SVMs that were derived in “SMM algorithms for SVMs”. Recall that \(\varvec{\gamma }\), \(\varvec{w}\), \(\Gamma\), and \(\mathbb {W}\) are interchangeable with \(\varvec{\theta }\), \(\varvec{z}\), \(\Theta\), and \(\mathbb {X}\times \left\{ -1,1\right\}\).

First, assumption (B1) can be satisfied in all cases by setting \(\Gamma \subset \mathbb {R}^{p}\) to be some hypercube \(\Theta =\left[ -a,a\right] ^{p+1}\) for some sufficiently large \(a>0\). We call this Assumption (B1a). Our approximate loss functions for the hinge and squared-hinge loss SVMs were constructed to satisfy (B2) and (B3), whereas the logistic loss function satisfies (B2) and (B3) naturally. There are further no issues with the convexity and continuity with \(g_{2}\left( \varvec{\theta };\varvec{z}\right) =\lambda \varvec{\beta }^{\top }\varvec{\beta }\), as it is simply a quadratic regularizer.

Assumption (C1) is satisfied for all three loss functions, as the surrogates \(\tilde{g}\left( \varvec{\theta },\varvec{\delta };\varvec{z}\right)\) are constructed to satisfy (A1) and (A2) in each case. Here, we can take \(\bar{\Theta }=\mathbb {R}^{p+1}\). Assumption (C2) is more difficult to assess, but is almost validated for all cases. Since all of the functions involved in the SMM algorithm are twice continuously differentiable, we can use the characterization that a function \(h\left( \varvec{\gamma }\right)\) taking inputs \(\varvec{\gamma }\in \Gamma\) is strongly convex if \(\partial ^{2}h\left( \varvec{\gamma }\right) /\partial \varvec{\gamma }\partial \varvec{\gamma }^{\top }-\mu \mathbf {I}_{p}\) is positive definite for some \(\mu >0\) (cf. Boyd and Vandenberghe 2004, Sect. 9.1.2). In other words, the smallest eigenvalue of the Hessian of \(h\left( \varvec{\gamma }\right)\) is lower bounded by the constant \(\mu\), for all \(\varvec{\gamma }\in \Gamma\). Given the previous definition, Assumption (C2) as applied to the approximate hinge loss, approximate squared-hinge loss, and logistic loss functions can be stated as: (C2H) the smallest eigenvalue of \(\frac{1}{2n}\tilde{\mathbf {Y}}_{n}^{\top }\varvec{\Omega }_{n}^{-1}\tilde{\mathbf {Y}}_{n}+2\lambda \tilde{\mathbf {I}}_{p}\) is lower bounded by some \(\mu >0\), (C2S) the smallest eigenvalue of \(\frac{2}{n}\tilde{\mathbf {Y}}_{n}^{\top }\tilde{\mathbf {Y}}_{n}+2\lambda \tilde{\mathbf {I}}_{p}\) is lower bounded by some \(\mu >0\), and (C2L) the smallest eigenvalue of \(\frac{1}{4n}\tilde{\mathbf {Y}}_{n}^{\top }\tilde{\mathbf {Y}}_{n}+2\lambda \tilde{\mathbf {I}}_{p}\) is lower bounded by some \(\mu >0\), respectively. Note that although we must make the explicit assumptions (C2H), (C2L), or (C2S), for the respective loss functions, we can be very confident that they are satisfied, since if \(\tilde{\mathbf {I}}_{p}\) were to be replaced by \(\mathbf {I}_{p+1}\) in each of the Hessian expressions, we would have strong convexity in every case by setting \(\mu =2\lambda\).

Next, we can check that (D1) is fulfilled for all three loss functions by construction. Furthermore, since each of the surrogate functions is quadratic and thus smooth, (D2) is fulfilled if (B1a) is assumed. Similarly, since the penalty \(g_{2}\left( \varvec{\theta };\varvec{z}\right)\) is a quadratic and thus smooth, (D3) is also fulfilled if (B1a) is assumed. Finally, since \(\tilde{g}_{1}\left( \varvec{\theta };\varvec{\delta },\varvec{z}\right)\) is continuous in every case and fulfill (D1), (D4) is automatically satisfied if (B1a) were assumed. We can, therefore, state the following convergence result regarding the three SMM algorithms.

Proposition 1

Under Assumptions (B1a), and (C2H), (C2L), or (C2S), the SMM algorithms for the SVM problems with approximate hinge loss, approximate squared-hinge loss, and logistic loss, as defined by the use of \(n\text {th}\) step iterates \(\varvec{\theta }^{n}\leftarrow \varvec{\theta }_{\text {H}}^{n}\), \(\varvec{\theta }^{n}\leftarrow \varvec{\theta }_{\text {S}}^{n}\), and \(\varvec{\theta }^{n}\leftarrow \varvec{\theta }_{\text {L}}^{n}\) in Algorithm 1 permit the conclusion of Theorem 1 with \(\left\{ \varvec{\gamma }^{i}\right\}\) replaced by the respective sequence of SMM iterates \(\left\{ \varvec{\theta }^{i}\right\}\) and \(\Gamma ^{*}\) replaced by the set of stationary points of the respective SVM problems \(\Theta ^{*}\).

Let \(\tilde{\Theta }\) be a convex and open subset of \(\Theta\) under Assumption (B1a). Via Chebyshev’s inequality, and under the assumptions of Proposition 1, \(\tilde{f}_{n}\left( \varvec{\theta }\right)\) approaches \(f\left( \varvec{\theta }\right)\) in probability as \(n\rightarrow \infty\) for each \(\varvec{\theta }\in \tilde{\Theta }\) and for any choice of loss function that we have considered, where the regularization \(g_{2}\left( \varvec{\theta };\varvec{z}\right) =\lambda \varvec{\beta }^{\top }\varvec{\beta }\). Furthermore, for any choice of loss function along with the quadratic regularization term, we can observe that \(\tilde{f}_{n}\left( \varvec{\theta }\right)\) is convex, since it is the sum of positive convex functions. Application of the convexity lemma from Pollard (1991) then yields the following conclusion.

Corollary 1

Let \(\tilde{\Theta }\) be the interior of \(\Theta\) under (B1a). Under the assumptions of Proposition 1, \(f\left( \varvec{\theta }\right)\) is convex on \(\tilde{\Theta }\) for any loss function \(g_{1}\left( \varvec{\theta };\varvec{z}\right)\) that was considered in Proposition 1 and quadratic regularization \(g_{2}\left( \varvec{\theta };\varvec{z}\right) =\lambda \varvec{\beta }^{\top }\varvec{\beta }\), where \(\lambda >0\). Furthermore, since \(f\left( \varvec{\theta }\right)\) is convex, the set of stationary points \(\Theta ^{*}\) for each problem is therefore the set of global minimizers of the respective problem on the set \(\tilde{\Theta }\).

5 Numerical simulations

The SMM algorithms that were described in “SMM algorithms for SVMs” are implemented in the R programming environment (R Core Team, 2016), with particularly computationally intensive loops programmed in C and integrated via the Rcpp package (Eddelbuettel, 2013). The implementations can be freely obtained at github.com/andrewthomasjones. All computations were conducted on a MacBook Pro with a 2.2 GHz Intel Core i7 processor, 16 GB of 1600 MHz DDR3 memory, and a 500 GB SSD. Computational times that are reported are obtained via the proc.time() function. Through prior experimentation, we have found that setting \(\epsilon =10^{-5}\) and \(\lambda =1/N\) yields good results in practice, where N is defined in the sequel. As such, for all of our numerical computations, these are the settings that we utilize.

5.1 Simulation 1

We sample streams of N observations \(\left\{ \varvec{z}_{i}\right\}\), where each \(\varvec{z}_{i}\) (\(i\in \left[ N\right]\)) is a realization of the random variable \(\varvec{Z}\sim \mathcal {L}_{\varvec{Z}}\). The law \(\mathcal {L}_{\varvec{Z}}\) is defined in the following hierarchical manner. First, \(Y\in \left\{ -1,1\right\}\) is generated with equal probability (i.e. \(\mathbb {P}\left( Y=-1\right) =\mathbb {P}\left( Y=1\right) =1/2\)). Next, conditional on \(Y=y\), \(\varvec{X}\) is generated from a \(p{\text {-dimensional}}\) multivariate Gaussian distribution with mean \(\Delta y\) and identity covariance matrix. The three factors that can be varied for the simulation are N, p, and \(\Delta\). Here, we choose to simulate the scenarios \(N\in \left\{ 1\times 10^{4},5\times 10^{4},1\times 10^{5}\right\}\), \(p\in \left\{ 5,10,20\right\}\), and \(\Delta \in \left\{ 0.125,0.25,0.5\right\}\).

5.2 Comparisons

Each of our three SMM SVM algorithms are assessed based on two factors. First, they are assessed based on the required computational time required to compute \(\hat{\varvec{\theta }}\), where \(\hat{\varvec{\theta }}\) is defined as the estimator of \(\varvec{\theta }\) for each of the SVM problems computed over a single sweep of the stream \(\left\{ \varvec{z}_{i}\right\}\). That is, each of the n observations from the stream \(\left\{ \varvec{z}_{i}\right\}\) is only accessed once.

Secondly, using a test set, we measure the accuracy of the classification rule \(\hat{y}={\text {sign}}\left( \tilde{\varvec{x}}^{\top }\hat{\varvec{\theta }}\right)\) for each of the SMM-fitted approximate hinge loss, approximate square-hinge loss, and logistic loss SVMs. We shall refer to these three SVMs as SMMH, SMMS, and SMML, from hereon in. For each simulation scenario, the computational time and accuracy are measured over 10 repetitions each and then averaged to yield appropriately precise measures of comparison.

Along with the three SVM algorithms that are described in “SMM algorithms for SVMs”, we also assess performances of the hinge loss, square-hinge loss, and logistic loss SVMs (with \(P\left( \varvec{\beta }\right) =\lambda \varvec{\beta }^{\top }\varvec{\beta }\)) as fitted by the LIBLINEAR package of solvers of Fan et al. (2008), applied via the LiblineaR package of Helleputte (2017). The hinge loss SVM can be fitted via a dual optimization routine (LIBHD), and the square-hinge and logistic loss SVMs can be fitted via both dual and primal optimization routines (LIBSD and LIBSP, and LIBLD and LIBLP). We note that the five optimization routines for fitting the three different SVM varieties from the LIBLINEAR package are batch methods and require the entire data set to be maintained in storage contemporaneously. These algorithms are therefore not directly comparable to the SMM algorithms. We have included the LIBLINEAR algorithms results to provide a “gold-standard” benchmark against which we can compare our algorithms.

For a stream-suitable algorithm benchmark, we also compare our methods to the PEGASOS algorithm of Shalev-Shwartz et al. (2011) as applied in R via a modified implementation of the codes obtained from github.com/wrathematics. The PEGASOS algorithm method solves the hinge loss SVM problem with stochastic sub-gradient descent and is run in its streamed form, where each \(\varvec{z}_{i}\) of \(\left\{ \varvec{z}_{i}\right\}\) is utilized once and in the order of its arrival. The algorithm is terminated upon having used up all N observations of the stream, as to make it comparable to SMMH, SMMS, and SMML. Each of the LIBLINEAR algorithms and PEGASOS are also compared to the three methods from “SMM algorithms for SVMs” based on the average computational time required to compute \(\hat{\varvec{\theta }}\) and the accuracy of the constructed classifier.

5.3 Results of Simulation 1

We present the timing results for Simulation 1 in Table 1. The corresponding accuracy results are presented in Table 2.
Table 1

Average computation times (in seconds) for Simulation 1 from 10 repetitions

\(\Delta\)

N

p

LIBHD

LIBSD

LIBLD

SMMH

SMMS

SMML

LIBSP

LIBLP

PEGASOS

0.125

1.00E+04

5

0.88

5.73

0.71

0.16

0.17

0.17

0.00

0.00

0.00

0.125

1.00E+04

10

1.24

8.21

1.04

0.50

0.53

0.53

0.00

0.00

0.00

0.125

1.00E+04

20

1.92

13.47

1.55

1.92

2.08

2.07

0.01

0.01

0.00

0.125

5.00E+04

5

5.42

33.39

4.23

0.82

0.84

0.85

0.01

0.02

0.00

0.125

5.00E+04

10

8.24

48.45

5.98

2.63

2.71

2.70

0.03

0.03

0.01

0.125

5.00E+04

20

12.93

80.00

9.19

10.46

10.58

10.57

0.06

0.07

0.02

0.125

1.00E+05

5

13.39

74.88

10.16

1.71

1.72

1.71

0.04

0.05

0.01

0.125

1.00E+05

10

19.04

102.25

13.02

5.35

5.42

5.40

0.06

0.07

0.02

0.125

1.00E+05

20

28.00

170.27

19.40

21.06

20.97

20.96

0.11

0.14

0.04

0.25

1.00E+04

5

1.03

5.96

0.79

0.17

0.17

0.17

0.00

0.01

0.00

0.25

1.00E+04

10

1.31

7.37

1.04

0.53

0.55

0.55

0.00

0.01

0.00

0.25

1.00E+04

20

1.54

8.24

1.56

1.98

2.10

2.10

0.01

0.01

0.00

0.25

5.00E+04

5

6.15

39.91

5.00

0.86

0.85

0.85

0.02

0.03

0.01

0.25

5.00E+04

10

7.53

42.07

6.15

2.69

2.70

2.69

0.03

0.04

0.01

0.25

5.00E+04

20

8.75

45.42

8.46

10.33

10.61

10.61

0.06

0.08

0.02

0.25

1.00E+05

5

11.83

63.72

8.30

1.72

1.68

1.67

0.03

0.05

0.01

0.25

1.00E+05

10

14.89

80.21

11.48

5.32

5.15

5.16

0.06

0.07

0.02

0.25

1.00E+05

20

16.62

83.35

16.27

20.28

20.45

20.43

0.10

0.13

0.03

0.5

1.00E+04

5

0.74

3.07

0.64

0.17

0.16

0.16

0.00

0.00

0.00

0.5

1.00E+04

10

0.56

2.54

0.83

0.49

0.51

0.51

0.00

0.01

0.00

0.5

1.00E+04

20

0.47

0.96

1.10

1.87

2.06

2.05

0.01

0.01

0.00

0.5

5.00E+04

5

3.53

17.95

3.62

0.85

0.82

0.81

0.02

0.03

0.00

0.5

5.00E+04

10

2.96

14.49

4.96

2.59

2.56

2.56

0.03

0.05

0.01

0.5

5.00E+04

20

1.89

6.24

6.47

9.71

10.26

10.25

0.05

0.08

0.01

0.5

1.00E+05

5

7.63

40.03

8.01

1.71

1.63

1.62

0.04

0.05

0.01

0.5

1.00E+05

10

6.44

29.33

10.29

5.12

5.19

5.18

0.06

0.09

0.01

0.5

1.00E+05

20

4.22

13.80

13.82

19.59

20.52

20.53

0.11

0.16

0.03

Table 2

Average training accuracies for Simulation 1 from 10 repetitions

\(\Delta\)

N

p

LIBHD

LIBSD

LIBLD

SMMH

SMMS

SMML

LIBSP

LIBLP

PEGASOS

0.125

1.00E+04

5

0.61

0.61

0.61

0.59

0.61

0.61

0.61

0.61

0.56

0.125

1.00E+04

10

0.65

0.65

0.65

0.65

0.65

0.65

0.65

0.65

0.59

0.125

1.00E+04

20

0.71

0.71

0.71

0.70

0.71

0.71

0.71

0.71

0.63

0.125

5.00E+04

5

0.61

0.61

0.61

0.61

0.61

0.61

0.61

0.61

0.54

0.125

5.00E+04

10

0.65

0.65

0.65

0.65

0.65

0.65

0.65

0.65

0.56

0.125

5.00E+04

20

0.71

0.71

0.71

0.71

0.71

0.71

0.71

0.71

0.62

0.125

1.00E+05

5

0.61

0.61

0.61

0.61

0.61

0.61

0.61

0.61

0.54

0.125

1.00E+05

10

0.65

0.65

0.65

0.65

0.65

0.65

0.65

0.65

0.57

0.125

1.00E+05

20

0.71

0.71

0.71

0.71

0.71

0.71

0.71

0.71

0.63

0.25

1.00E+04

5

0.71

0.71

0.71

0.71

0.71

0.71

0.71

0.71

0.63

0.25

1.00E+04

10

0.78

0.78

0.78

0.78

0.78

0.78

0.78

0.78

0.70

0.25

1.00E+04

20

0.87

0.87

0.87

0.86

0.87

0.87

0.87

0.87

0.81

0.25

5.00E+04

5

0.71

0.71

0.71

0.71

0.71

0.71

0.71

0.71

0.63

0.25

5.00E+04

10

0.78

0.78

0.78

0.78

0.78

0.78

0.78

0.78

0.73

0.25

5.00E+04

20

0.87

0.87

0.87

0.87

0.87

0.87

0.87

0.87

0.82

0.25

1.00E+05

5

0.71

0.71

0.71

0.71

0.71

0.71

0.71

0.71

0.65

0.25

1.00E+05

10

0.79

0.79

0.79

0.78

0.79

0.79

0.79

0.79

0.71

0.25

1.00E+05

20

0.87

0.87

0.87

0.86

0.87

0.87

0.87

0.87

0.82

0.5

1.00E+04

5

0.87

0.87

0.87

0.87

0.87

0.87

0.87

0.87

0.79

0.5

1.00E+04

10

0.94

0.94

0.94

0.94

0.94

0.94

0.94

0.94

0.91

0.5

1.00E+04

20

0.99

0.99

0.99

0.98

0.99

0.99

0.99

0.99

0.98

0.5

5.00E+04

5

0.87

0.87

0.87

0.86

0.87

0.87

0.87

0.87

0.82

0.5

5.00E+04

10

0.94

0.94

0.94

0.94

0.94

0.94

0.94

0.94

0.90

0.5

5.00E+04

20

0.99

0.99

0.99

0.98

0.99

0.99

0.99

0.99

0.98

0.5

1.00E+05

5

0.87

0.87

0.87

0.87

0.87

0.87

0.87

0.87

0.81

0.5

1.00E+05

10

0.94

0.94

0.94

0.94

0.94

0.94

0.94

0.94

0.92

0.5

1.00E+05

20

0.99

0.99

0.99

0.98

0.99

0.99

0.99

0.99

0.98

From Table 1, we observe the SMM algorithms tend to increase in computational time due to increases in N and p. Increases in N tend to yield a linear increase in computational time; however, increases in p yield nonlinear increases. Given each of the expressions (10), (14), and (17), we can write the time complexity of the SMM as \(O\left( np^{3}\right)\) which is congruent to our observations. We notice that the SMM algorithms are not at all affected by the class separation \(\Delta\). All three of the SMM algorithms require approximately the same amount of computational time in all cases.

In contrast, we observe that the dual algorithms LIBHD, LIBSD, and LIBLD have computational times that are decreasing as \(\Delta\) increases. These algorithms also exhibit increasing computational time as N and p increase, although there are some nonlinearities such as computational times decreasing as p increases for fixed N and \(\Delta\). This occurs in the cases where \(\Delta =0.5\). Other than these effects, we notice that the SMM algorithms are comparable to the dual algorithms when p is 5 or 10 and are slower when \(p=20\) for \(\Delta =0.5\). For \(\Delta =0.25\), the same statement holds except when compared to LIBSD, which is much slower than all other algorithms in all cases of p and N. For \(\Delta =0.125\), we observe that the SMM algorithms are faster than their dual counterparts except in the case of \(N=10{,}000\) and \(p=20\), where the LIBHD and LIBLD algorithms are faster by a small amount. The LIBLD algorithm is also substantially slower than the other algorithms in all cases here.

The three algorithms LIBSP, LIBLP, and PEGASOS are all multiple orders of magnitude faster than the SMM algorithms in all cases. This is likely due to their conjugate gradient and coordinate-descent forms, and optimized implementations, rather than the Newton-like iterations and ad-hoc implementations of our SMM algorithms.

From Table 2, we observe that the accuracy of the SMM algorithms is in most cases equal to the accuracy of the five batch algorithms. This is very surprising as the SMM algorithm is only permitted to inspect each observation from the stream \(\left\{ \varvec{z}_{i}\right\}\) once, whereas the batch algorithms are permitted as many inspections of the observations as are required for convergence. Only the SMMH algorithm is less accurate than the batch algorithms in some cases. In situations, where it is less accurate, the deficit is only 0.01 or 0.02, which we find tolerable given the streaming nature of the algorithm. The PEGASOS algorithm is substantially less accurate than all of the other algorithms when implemented in its streaming configuration. Given this fact, one is faced with a tradeoff between speed and accuracy when comparing PEGASOS to the SMM algorithms.

5.4 Simulation 2

In this second set of simulations, we consider simulations of some larger data sets. The simulations are conducted as described in “Simulation 1”. However, we now choose to simulate the scenarios \(N\in \left\{ 5\times 10^{5},1\times 10^{6},5\times 10^{6}\right\}\), \(p\in \left\{ 10,20,50\right\}\), and \(\Delta \in \left\{ 0.125,0.25,0.5\right\}\).

5.5 Results of Simulation 2

We present the timing results for Simulation 2 in Table 3. The corresponding accuracy results are presented in Table 4. We did not compare the SMM algorithms with the the dual algorithms LIBHD, LIBSD, and LIBLD, due to their similarity in speed in the \(\Delta =0.25\) and 0.5 cases, as well as their larger computational cost when \(\Delta =0.125\).
Table 3

Average computation times (in seconds) for Simulation 2 from 10 repetitions

\(\Delta\)

N

p

SMMH

SMMS

SMML

LIBSP

LIBLP

PEGASOS

0.125

5.00E+05

10

1.42

1.91

1.93

0.33

0.33

0.05

0.125

5.00E+05

20

4.43

6.49

6.49

0.64

0.76

0.10

0.125

5.00E+05

50

30.77

49.03

49.02

1.45

1.73

0.17

0.125

1.00E+06

10

2.59

3.51

3.51

0.66

0.77

0.09

0.125

1.00E+06

20

8.76

12.57

12.57

1.58

2.00

0.16

0.125

1.00E+06

50

64.78

103.06

102.54

3.67

4.44

0.40

0.125

5.00E+06

10

12.81

17.66

17.66

3.41

3.85

0.46

0.125

5.00E+06

20

43.47

61.97

62.28

8.66

9.36

0.80

0.125

5.00E+06

50

320.11

510.16

512.75

27.75

27.48

2.02

0.25

5.00E+05

10

1.39

1.90

1.88

0.52

0.59

0.05

0.25

5.00E+05

20

4.36

6.22

6.26

0.87

1.10

0.07

0.25

5.00E+05

50

32.16

51.67

51.69

2.13

2.66

0.19

0.25

1.00E+06

10

2.77

3.79

3.79

1.15

1.25

0.10

0.25

1.00E+06

20

8.70

12.46

12.47

1.80

2.18

0.15

0.25

1.00E+06

50

61.34

97.55

96.89

3.17

3.98

0.35

0.25

5.00E+06

10

12.68

17.46

17.38

4.51

5.15

0.44

0.25

5.00E+06

20

39.98

57.60

57.98

6.71

9.10

0.73

0.25

5.00E+06

50

299.10

477.24

477.83

24.80

23.16

1.74

0.5

5.00E+05

10

1.28

1.77

1.74

0.33

0.48

0.04

0.5

5.00E+05

20

4.02

5.75

5.79

0.64

1.05

0.07

0.5

5.00E+05

50

29.91

47.71

47.69

1.61

2.05

0.16

0.5

1.00E+06

10

2.55

3.47

3.46

0.83

1.17

0.08

0.5

1.00E+06

20

8.01

11.53

11.57

1.39

2.05

0.14

0.5

1.00E+06

50

59.90

95.64

95.52

2.72

4.16

0.34

0.5

5.00E+06

10

12.62

17.33

17.34

4.22

5.80

0.42

0.5

5.00E+06

20

43.07

62.35

61.80

7.90

10.96

0.70

0.5

5.00E+06

50

319.80

504.86

503.23

24.23

31.03

1.88

Table 4

Average training accuracies for Simulation 2 from 10 repetitions

\(\Delta\)

N

p

SMMH

SMMS

SMML

LIBSP

LIBLP

PEGASOS

0.125

5.00E+05

10

0.65

0.65

0.65

0.65

0.65

0.56

0.125

5.00E+05

20

0.71

0.71

0.71

0.71

0.71

0.62

0.125

5.00E+05

50

0.81

0.81

0.81

0.81

0.81

0.74

0.125

1.00E+06

10

0.65

0.65

0.65

0.65

0.65

0.57

0.125

1.00E+06

20

0.71

0.71

0.71

0.71

0.71

0.62

0.125

1.00E+06

50

0.81

0.81

0.81

0.81

0.81

0.74

0.125

5.00E+06

10

0.65

0.65

0.65

0.65

0.65

0.56

0.125

5.00E+06

20

0.71

0.71

0.71

0.71

0.71

0.64

0.125

5.00E+06

50

0.81

0.81

0.81

0.81

0.81

0.74

0.25

5.00E+05

10

0.78

0.79

0.79

0.79

0.79

0.71

0.25

5.00E+05

20

0.87

0.87

0.87

0.87

0.87

0.81

0.25

5.00E+05

50

0.96

0.96

0.96

0.96

0.96

0.94

0.25

1.00E+06

10

0.78

0.79

0.79

0.79

0.79

0.72

0.25

1.00E+06

20

0.87

0.87

0.87

0.87

0.87

0.81

0.25

1.00E+06

50

0.96

0.96

0.96

0.96

0.96

0.94

0.25

5.00E+06

10

0.78

0.79

0.79

0.79

0.79

0.70

0.25

5.00E+06

20

0.87

0.87

0.87

0.87

0.87

0.82

0.25

5.00E+06

50

0.96

0.96

0.96

0.96

0.96

0.94

0.5

5.00E+05

10

0.94

0.94

0.94

0.94

0.94

0.92

0.5

5.00E+05

20

0.99

0.99

0.99

0.99

0.99

0.98

0.5

5.00E+05

50

1.00

1.00

1.00

1.00

1.00

1.00

0.5

1.00E+06

10

0.94

0.94

0.94

0.94

0.94

0.90

0.5

1.00E+06

20

0.99

0.99

0.99

0.99

0.99

0.98

0.5

1.00E+06

50

1.00

1.00

1.00

1.00

1.00

1.00

0.5

5.00E+06

10

0.94

0.94

0.94

0.94

0.94

0.91

0.5

5.00E+06

20

0.99

0.99

0.99

0.99

0.99

0.98

0.5

5.00E+06

50

1.00

1.00

1.00

1.00

1.00

1.00

From Table 3, we notice the same relationships between computational time and N, p, and \(\Delta\) as in Simulation 1. We also notice that the SMMH algorithm is faster than the SMMS and SMML algorithms in all cases. Upon inspection, it appears that the SMM algorithms are approximately between one and two orders of magnitude slower than the batch algorithms LIBSP and LIBLP in each scenario. The PEGASOS algorithm is then another order of magnitude faster than the batch algorithms.

From Table 4, we observe again, as with Simulation 1, that the SMM algorithms are nearly always equal in accuracy to the batch algorithms. The only exception is that the SMMH algorithm has an accuracy deficit of 0.01 in some occasions. The PEGASOS algorithm is once again significantly less accurate than the other algorithms. Thus, we are again faced once more with a choice between speed and accuracy when choosing between the PEGASOS and SMM algorithms.

We finally remark that making comparisons between the computational times of the SMM algorithms and the batch algorithms is only for the sake of benchmarking and may be misleading to interpret from a practical point of view. This is because the two classes of algorithms are constructed to perform two seemingly similar but fundamentally different learning tasks. If one considers the computational time per iteration of the SMM algorithms, it would yield a more practically meaningful index of performance as it would be more congruous with the intended applicational setting of said algorithms.

Remark 1

From the joint results of the two simulations (Simulations 1 and 2), we observe that SMMH is generally markedly faster than SMMS and SMML. However, the additional speed of the SMMH algorithm appears to necessitate a small tradeoff to accuracy, as we observe that SMMS and SMML yield slightly higher accuracy compared to SMMH, in a small number of scenarios. Thus, in applications, we recommend the use of SMMH when speed is the only imperative, whereas SMMS and SMML may be more appropriate when a balance between speed and accuracy is required.

Remark 2

We can make the following recommendations, from the above comparisons between the algorithms from the LIBLINEAR package, the SMM algorithms, and PEGASOS. First, it is best not to use LIBHD, LIBSD, and LIBLD in any circumstance, as they require excessive computational time without providing better accuracy in return, when compared to the other methods. Second, for batch data, the LIBSP and LIBLP algorithms are preferred to the SMM algorithms, because they have been highly optimized for operation in such circumstances and are both fast and accurate when applied to large batch samples. Next, the SMM algorithms are able to achieve nearly identical accuracy levels to LIBSP and LIBLP while operating on streamed data. The algorithms are, however, slower when the data are acquired in batch. Finally, if computational time is the only imperative, then PEGASOS should be preferred to all of the other algorithms, whether the data are acquired in batch or whether the data are streamed. However, PEGASOS is significantly slower than the other methods, especially when class separation and sample size are small.

6 MNIST data analysis

The MNIST data set consists of two components. The first component is an \(M=60{,}000\) training sample of observations \(\left\{ \varvec{\zeta }_{i}\right\} _{\text {Train}}\) representing images of handwritten digits between 0 and 9. The second component is an \(N=10{,}000\) testing sample of observations \(\left\{ \varvec{\zeta }_{i}\right\} _{\text {Test}}\). Each datum of \(\left\{ \varvec{\zeta }_{i}\right\} _{\text {Train}}\) of \(\left\{ \varvec{\zeta }_{i}\right\} _{\text {Test}}\) consists of observations \(\varvec{\zeta }_{i}^{\top }=\left( y_{i},\varvec{\xi }_{i}^{\top }\right)\) (\(i\in \left[ M\right]\) or \(i\in \left[ N\right]\)), where \(y_{i}\in \left\{ -1,1\right\}\) is a label corresponding to whether or not the the image is of a zero or otherwise (\(y_{i}=-1\) for zero, and \(y_{i}=1\) otherwise). The vector \(\varvec{\xi }_{i}\in \left\{ 0,\dots ,255\right\} ^{784}\) expresses the greyscale intensity of the image at each of its \(q=28\times 28=784\) pixels. Figure 1 displays the first 100 images from \(\left\{ \varvec{\zeta }_{i}\right\} _{\text {Train}}\). We note that there are 5923 zeros in the training set and 980 zeros in the testing set.
Fig. 1

First 100 images from the MNIST training set

We seek to use the training set \(\left\{ \varvec{\zeta }_{i}\right\} _{\text {Train}}\) to construct an SVM classifier that can accurately distinguish zero from nonzero images from the test set \(\left\{ \varvec{\zeta }_{i}\right\} _{\text {Test}}\). We shall compare the performance of the three SMM algorithms along with the LIBSP, LIBLP, and PEGASOS algorithms on this task. The performance indicators are the average test set accuracy from 10 repetitions of the algorithms, from a random ordering of the training set, and the average computational time in seconds.

6.1 Preprocessing

Before progressing with our analyses, we first reduce the dimensionality of our training and testing data sets. Using the training data \(\left\{ \varvec{\zeta }_{i}\right\} _{\text {Train}}\), we perform a principal component analysis (PCA; see for example Jolliffe 2002) decomposition of the raw intensities \(\varvec{\xi }_{i}\) to yield the principal component (PC) features \(\varvec{x}_{i}\in \mathbb {R}^{p}\), where \(\varvec{x}_{i}\) contains the first p PCs. We select \(p\in \left\{ 10,20,50\right\}\) for our experiments to construct the preprocessed training and test sets \(\left\{ \varvec{z}_{i}\right\} _{\text {Train}}\) and \(\left\{ \varvec{z}_{i}\right\} _{\text {Test}}\), where \(\varvec{z}_{i}^{\top }=\left( y_{i},\varvec{x}_{i}^{\top }\right)\). All constructions of classifiers and reporting of classifier performances are based on the use of these preprocessed data.

6.2 Results

We present the timing results in Table 5, with the corresponding accuracy results, as presented in Table 6.
Table 5

Average computation times (in seconds) for the MNIST task from 10 repetitions

p

SMMH

SMMS

SMML

LIBSP

LIBLP

PEGASOS

10

0.16

0.22

0.22

0.31

0.35

0.01

20

0.49

0.72

0.71

0.63

1.21

0.02

50

3.66

5.87

5.84

2.77

4.37

0.05

Table 6

Average testing accuracies for the MNIST task from 10 repetitions

p

SMMH

SMMS

SMML

LIBSP

LIBLP

PEGASOS

10

0.96

0.97

0.97

0.97

0.97

0.52

20

0.97

0.97

0.97

0.98

0.99

0.49

50

0.97

0.98

0.98

0.99

0.99

0.48

From Table 5, we observe that the SMM algorithms are marginally faster than the batch algorithms when \(p=10\) and comparable to the batch algorithms when p is 20 or 50. The approximate hinge loss algorithm SMMH is faster than the other two SMM algorithms in all three cases of p. The PEGASOS algorithm is faster than all of the other algorithms by an order of magnitude in the \(p=10\) case and two orders of magnitude when p is 20 or 50. We note that the better performance of the SMM algorithm in this task, as compared to the batch algorithms in Simulations 1 and 2, may be due to the lack of balance in the data set between \(y_{i}=-1\) and \(y_{i}=1\) observations. Here, the ratio is approximately one to nine, whereas the ratio in the simulations is one to one.

From Table 6, we observe that the SMM algorithms are nearly as accurate as the two batch algorithms. Where they are not as accurate, the difference in accuracy is only 0.01 or 0.02. This is a good result as it demonstrates that there is little to be lost from learning in a streamed environment in contrast to requiring the data be available in batch. We note that the PEGASOS algorithm is once again significantly less accurate than all other algorithms. Here, PEGASOS as implemented, performs worse than simply guessing in proportion to the ratio of the classes.

7 Conclusions

In modern data analysis, there is a need for the development of new algorithms that are operable within the Big Data context. One prevalent notion of Big Data is that it is defined via the three Vs: variety, velocity, and volume. Whereas variety must be considered when choosing a model for any particular task, the fitting of the model and conducting of said task are required to be amenable to high velocity data of large volume (e.g. streamed data that cannot be stored in memory simultaneously).

The SMM framework as, proposed by Razaviyayn et al. (2016), provides a useful paradigm for constructing algorithms that can cater to the analysis of large volumes of streamed data. Using the SMM framework, we have constructed three algorithms SMMH, SMMS, and SMML that solve the approximate hinge loss, approximate squared-hinge loss, and logistic loss SVM problems in streamed data setting, respectively.

Using the theoretical results of Razaviyayn et al. (2016), we have validated that the three constructed algorithms are convergent under mild regularity conditions and that they converge to globally minimal solutions. Two simulation studies demonstrate that the SVMs obtained via the constructed SMM algorithms are comparable in accuracy to state-of-the-art batch algorithms from the LIBLINEAR package, and also outperform the leading stream algorithm PEGASOS. With respect to timing, the SMM algorithms are found to be fast but not as fast as the LIBLINEAR or PEGASOS algorithms. It is difficult to make comparisons between the SMM algorithms and the LIBLINEAR algorithms based on speed as they are performing two similar but overall different tasks. However, the difference compared to PEGASOS implies that there is a tradeoff between speed and accuracy when choosing between the SMM algorithms and PEGASOS.

Finally, we conducted an example analysis of the MNIST data set of LeCun (1998). Here, we again find that the SMM algorithms are comparable in accuracy to the LIBLINEAR algorithms and surpass the accuracy of PEGASOS. In this real data setting, the SMM algorithm was also found to be somewhat more comparable in computational timing performance to the LIBLINEAR algorithms. PEGASOS is again faster than the SMM algorithms, although there is once more a tradeoff between accuracy and speed.

It can be seen that the SMM algorithms require more computation time than the LIBSP and LIBLP algorithms from LIBLINEAR, particularly when p is large. This is the case due to the use of careful and highly optimized use of a trust region or line-search Newton-like or conjugate gradient method within the two aforementioned algorithms (cf. Lin et al. 2008; Hsia et al. 2017). Unfortunately, the techniques that are implemented are specific to the use of the Newton-like or conjugate gradient methods for solving the squared-hinge and logistic loss SVM problems. Thus, the discussed techniques cannot be ported to work with our SMM algorithms in a trivial way.

Recently, works by Chouzenoux et al. (2011), Chouzenoux et al. (2013), and Chouzenoux and Pesquet (2017) have shown that the trust region, line search, and Newton-like methods can be used within the MM and SMM framework to obtain computationally efficient and competitive solutions to solving various optimization problems from image and signal processing. Furthermore, generic coordinate-descent methodology for MM algorithms can also be utilized to obtain speedup and better scaling with p. Such methods include the use of the majorizer of De Pierro (1993) or the blockwise frameworks of Razaviyayn et al. (2013) and Mairal (2015).

These directions were not pursued within this article due to the numerous additional theoretical results that would be required to demonstrate the global converge of such constructions. However, a result for a specific SMM solver of a least-squares regression problem was obtained by Chouzenoux and Pesquet (2017). This indicates that similar results for our SVM problems are also available We defer the construction of these extended algorithms and the establishment of their global convergence for future work.

Notes

Acknowledgements

We thank the Associate Editor and Reviewer of the article for making helpful comments that greatly improved our exposition. HDN was supported by Australian Research Council (ARC) Grant DE170101134. GJM was supported by ARC Grant DP170100907.

References

  1. Abe, S. (2010). Support Vector Machines for Pattern Classification. London: Springer.CrossRefzbMATHGoogle Scholar
  2. Bohning, D., & Lindsay, B. R. (1988). Monotonicity of quadratic-approximation algorithms. Annals of the Institute of Mathematical Statistics, 40, 641–663.MathSciNetCrossRefzbMATHGoogle Scholar
  3. Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge: Cambridge University Press.CrossRefzbMATHGoogle Scholar
  4. Cappe, O., & Moulines, E. (2009). On-line expectation-maximizatoin algorithm for latent data models. Journal of the Royal Statistical Society B, 71, 593–613.CrossRefzbMATHGoogle Scholar
  5. Chouzenoux, E., Idier, J., & Moussaoui, S. (2011). A majorize-minimize strategy for subspace otpimization applied to image restoration. IEEE Transactions on Image Processing, 20, 1517–1528.MathSciNetCrossRefzbMATHGoogle Scholar
  6. Chouzenoux, E., Jezierska, A., Pesquet, J.-C., & Talbot, H. (2013). A majorize-minimize subspace approach for \(l_2\)-\(l_0\) image regularization. SIAM Journal of Imaging Science, 6, 563–591.CrossRefzbMATHGoogle Scholar
  7. Chouzenoux, E., & Pesquet, J.-C. (2017). A stochastic majorize-minimize subspace algorithm for online penalized least squares estimation. IEEE Transactions on Signal Processing, 65, 4770–4783.MathSciNetCrossRefGoogle Scholar
  8. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273–297.zbMATHGoogle Scholar
  9. De Pierro, A. R. (1993). On the relation between the ISRA and the EM algorithm for positron emission tomography. IEEE Transactions on Medical Imaging, 12, 328–333.CrossRefGoogle Scholar
  10. Eddelbuettel, D. (2013). Seamless R and C++ Integration with Rcpp. New York: Springer.CrossRefzbMATHGoogle Scholar
  11. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin, C.-J. (2008). LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874.zbMATHGoogle Scholar
  12. Groenen, P. J. F., Nalbantov, G., & Bioch, J. C. (2008). SVM-Maj: a majorization approach to linear support vector machines with different hinge errors. Advances in Data Analysis and Classification, 2, 17–43.MathSciNetCrossRefzbMATHGoogle Scholar
  13. Helleputte, T. (2017). LiblineaR: Linear Predictive Models Based on the LIBLINEAR C/C++ Library Google Scholar
  14. Hsia, C.-Y., Zhu, Y., & Lin, C.-J. (2017). A study on trust region update rules in Newton methods for large-scale linear classification. Proceedings of Machine Learning Research, 77, 33–48.Google Scholar
  15. Jolliffe, I. T. (2002). Principal Component Analysis. New York: Springer.zbMATHGoogle Scholar
  16. Kim, S., Pasupathy, R., & Henderson, S. G. (2015). Handbook of Simulation Optimization, chapter A guide to sample average approximation (pp. 207–243). New York: Springer.Google Scholar
  17. Lange, K. (2016). MM Optimization Algorithms. Philadelphia: SIAM.CrossRefzbMATHGoogle Scholar
  18. LeCun, Y. (1998). The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/Google Scholar
  19. Lin, C.-J., Weng, R. C., & Keerthi, S. S. (2008). Trust region Newton method for large-scale logistic regression. Journal of Machine Learning Research, 9, 627–650.MathSciNetzbMATHGoogle Scholar
  20. Mairal, J. (2013). Stochastic majorization-minimization algorithms for large-scale optimization. In Advances in Neural Information Processing Systems (pp. 2283–2291)Google Scholar
  21. Mairal, J. (2015). Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal of Optimization, 25, 829–855.MathSciNetCrossRefzbMATHGoogle Scholar
  22. McAfee, A., Brynjolfsson, E., & Davenport, T. H. (2012). Big data: the management revolution. Harvard Business Review, 90, 60–68.Google Scholar
  23. Navia-Vasquez, A., Perez-Cruz, F., Artes-Rodriguez, A., & Figueiras-Vidal, A. R. (2001). Weighted least squares training of support vector classifiers leading to compact and adaptive schemes. IEEE Transactions on Neural Networks, 12, 1047–1059.CrossRefGoogle Scholar
  24. Nguyen, H. D. & McLachlan, G. J. (2017). Iteratively-reweighted least-squares fitting of support vector machines: a majorization-minimization algorithm approach. In Proceedings of the 2017 Future Technologies Conference (FTC) Google Scholar
  25. Pollard, D. (1991). Asymptotics for least absolute deviation regression estimators. Econometric Theory, 7, 186–199.MathSciNetCrossRefGoogle Scholar
  26. R Core Team (2016). R: a language and environment for statistical computing. R Foundation for Statistical ComputingGoogle Scholar
  27. Razaviyayn, M., Hong, M., & Luo, Z.-Q. (2013). A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal of Optimization, 23, 1126–1153.MathSciNetCrossRefzbMATHGoogle Scholar
  28. Razaviyayn, M., Sanjabi, M., & Luo, Z.-Q. (2016). A stochastic successive minimization method for nonsmooth nonconvex optimization with applications to transceiver design in wireless communication networks. Mathematical Programming Series B, 157, 515–545.MathSciNetCrossRefzbMATHGoogle Scholar
  29. Scholkopf, B., & Smola, A. J. (2002). Learning with Kernels. Cambridge: MIT Press.zbMATHGoogle Scholar
  30. Shalev-Shwartz, S., Singer, Y., Srebro, N., & Cotter, A. (2011). Pegasos: primal estimated sub-gradient solver for SVM. Mathematical Programming Series B, 127, 3–30.MathSciNetCrossRefzbMATHGoogle Scholar
  31. Shawe-Taylor, J., & Sun, S. (2011). A review of optimization methodologies in support vector machines. Neurocomputing, 74, 3609–3618.CrossRefGoogle Scholar
  32. Steinwart, I., & Christmann, A. (2008). Support Vector Machine. New York: Springer.zbMATHGoogle Scholar
  33. Titterington, D. M. (1984). Recursive parameter estimation using incomplete data. Journal of the Royal Statistical Society B, 46, 257–267.MathSciNetzbMATHGoogle Scholar
  34. Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning Google Scholar

Copyright information

© Japanese Federation of Statistical Science Associations 2018

Authors and Affiliations

  1. 1.Department of Mathematics and StatisticsLa Trobe UniversityMelbourneAustralia
  2. 2.School of Mathematics and PhysicsUniversity of QueenslandSt. LuciaAustralia

Personalised recommendations