# Stream-suitable optimization algorithms for some soft-margin support vector machine variants

## Abstract

Soft-margin support vector machines (SVMs) are an important class of classification models that are well known to be highly accurate in a variety of settings and over many applications. The training of SVMs usually requires that the data be available all at once, in batch. The Stochastic majorization–minimization (SMM) algorithm framework allows for the training of SVMs on streamed data instead. We utilize the SMM framework to construct algorithms for training hinge loss, squared-hinge loss, and logistic loss SVMs. We prove that our three SMM algorithms are each convergent and demonstrate that the algorithms are comparable to some state-of-the-art SVM-training methods. An application to the famous MNIST data set is used to demonstrate the potential of our algorithms.

## Keywords

Big data MNIST Stochastic majorization–minimization algorithm Streamed data Support vector machines## 1 Introduction

The soft-margin support vector machines—which we shall refer to as SVMs from hereon in—were first proposed in Cortes and Vapnik (1995). Ever since their introduction, the SVMs have become a popular and powerful data analytic tool for conducting classification of labeled data at all scales. Some good texts that treat the various aspects of SVMs are Scholkopf and Smola (2002), Steinwart and Christmann (2008), and Abe (2010).

*a*is true and 0 otherwise. Under the setup above, we seek to determine a parameter vector \(\hat{\varvec{\theta }}\) that solves the optimization problem:

*Y*is unknown. That is, for some observed \(\varvec{x}\), we can make a predication of

*Y*via the classification rule \(\hat{y}={\text {sign}}(\tilde{\varvec{x}}^{\top }\hat{\varvec{\theta }})\), where \({\text {sign}}(a)=-1\) if \(a<0\), and \({\text {sign}}(a)=1\) otherwise.

*n*IID (independent and identically distributed) observations \(\{ \varvec{Z}_{i}\}\), where \(\varvec{Z}_{i}\sim \mathcal {L}_{\varvec{Z}}\) (\(i\in [n]=\{ 1,\dots ,n\}\)). We can then approximate Problem 1 by the average loss optimization problem:

Over the years, there have been many proposals for the choice of a surrogate loss function. The original SVM risk of Cortes and Vapnik (1995) utilized the hinge loss function \(l_{\text {H}}(\varvec{z};\varvec{\theta })=\left[ 1-y\tilde{\varvec{x}}^{\top }\varvec{\theta }\right] _{+}\), where \([a]_{+}=\max \{ 0,a\}\). In our related article, Nguyen and McLachlan (2017), we considered the squared-hinge loss \(l_{\text {S}}(\varvec{z};\varvec{\theta })=\left[ 1-y\tilde{\varvec{x}}^{\top }\varvec{\theta }\right] _{+}^{2}\), as well as the logistic loss \(l_{\text {L}}(\varvec{z};\varvec{\theta })=\log [1+\exp (-y\tilde{\varvec{x}}^{\top }\varvec{\theta })]\). Some other popular surrogate losses are suggested in Zhang (2004).

In Cortes and Vapnik (1995), it was found that any instance of Problem (3) with a hinge loss surrogate was a quadratic program and thus could be solved via any general quadratic programming solver. The literature now contains an abundance of methods for solving Problem (3) under various settings. A comprehensive review and critique of current solution techniques to Problem (3) appears in Shawe-Taylor and Sun (2011). Of particular relevance to the reader of this article, Navia-Vasquez et al. (2001), Groenen et al. (2008), and Nguyen and McLachlan (2017) all considered iteratively reweighted least-squares (IRLS) approaches for solving different variants of Problem (3) for batch data (static data that are available all at once).

The leading paradigm that defines the Big Data context is the notion of the three V’s (cf. McAfee et al. 2012), where the V’s each stand for variety, velocity, and volume. Variety is generally addressed via the modeling of data and model choice, whereas velocity (the fact that data are not static and accumulated over time) and volume (the fact that data are large in comparison to modern computing resources) require careful consideration of the manner in which models are fitted once they are chosen. In this article, we consider a new algorithm for solving Problem (3) for various loss functions that are designed to address the problems of having potentially high volume and high velocity data.

In Nguyen and McLachlan (2017), we showed that it was possible to construct IRLS algorithms for solving various instances of Problem (3) for batch data, using the majorization–minimization (MM) algorithm paradigm of Lange (2016). The constructed algorithms were demonstrated to exhibit convergent behavior, where upon the solution \(\hat{\varvec{\theta }}\) approaches the global minimizer of the respective problem instance as the number of iterations of each algorithm approaches infinity.

In addition to the qualitative advantages that we have described above, the newly constructed algorithms can also be proved to be globally convergent. That is, as the number of observations *n* in the data stream \(\left\{ \varvec{z}_{i}\right\}\) increases, each of the algorithms produce a solution \(\varvec{\theta }^{n}\) to Problem (4) that approaches the global minimizer of the problem with probability one. This is a useful guarantee that corresponds contextually to the convergence results that were obtained in Nguyen and McLachlan (2017).

We complement our theoretical results with some numerical simulations that display the typical performance of the constructed algorithms in various settings. As a demonstration, we also apply our algorithms to a classification problem involving the classic MNIST data of LeCun (1998).

The rest of the article proceeds as follows. A description of the SMM optimization framework proposed in this article is provided in “Stochastic MM algorithm”. In third section, we derive the SMM algorithms for the addressed SVMs. In “Convergence analysis”, theoretical results are presented regarding the convergence of each of the algorithms. Numerical simulations are then provided in fifth section. In the sixth section, we demonstrate the algorithms via applications to the MNIST data classification problem. Conclusions are then drawn in the final section.

## 2 Stochastic MM algorithm

The SMM algorithm that we present here is the one described by Razaviyayn et al. (2016). An alternative approach to the construction of stochastic MM-type optimization schemes was also considered by Mairal (2013). However, the approach of Mairal (2013) results in a somewhat more complicated set of iterations and convergence conditions than the set of Razaviyayn et al. (2016). The SMM algorithm that is discussed is further connected to the stochastic expectation–maximization (EM) algorithm of Titterington (1984) and the online EM algorithm of Cappe and Moulines (2009). Each of these four mentioned frameworks are suitable for various different settings and tasks.

Due to the potential lack of convexity of the function \(g_{1}\left( \varvec{\gamma };\varvec{w}\right)\) or lack of differentiability of the function \(g_{2}\left( \varvec{\gamma };\varvec{w}\right)\), the standard SAA approach may require potentially computationally intensive approaches that may be iterative to compute a new solution \(\varvec{\gamma }^{n}\), upon the introduction of the \(n{\text {th}}\) observation. This results in an algorithmic scheme that requires iterations upon iterations and is thus not suitable for the Big Data setting.

Suppose that we can find some function \(\tilde{g}_{1}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right)\) with the following properties: (A1) \(\tilde{g}_{1}\left( \varvec{\delta },\varvec{\delta };\varvec{w}\right) =g_{1}\left( \varvec{\delta };\varvec{w}\right)\) for all \(\varvec{\delta }\in \Gamma\) and \(\varvec{w}\in \mathbb {W}\), and (A2) \(\tilde{g}_{1}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right) =g_{1}\left( \varvec{\gamma };\varvec{w}\right)\) for all \(\varvec{\gamma }\in \tilde{\Gamma }\), \(\varvec{\delta }\in \Gamma\), and \(\varvec{w}\in \mathbb {W}\). We call any function that satisfies (A1) and (A2) a majorizer of \(g_{1}\left( \varvec{\gamma };\varvec{w}\right)\). Here, the majorizer \(\tilde{g}_{1}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right)\) is assumed to be simpler to minimize than the function that it majorizes.

## 3 SMM algorithms for SVMs

To construct the SMM algorithms for the hinge loss, squared-hinge loss, and logistic loss SVM optimization problems, we require the following majorizers. The derivation of these facts arise in Nguyen and McLachlan (2017) and Bohning and Lindsay (1988), and thus are omitted for brevity. Throughout this section, we use \(\varvec{\gamma }\) and \(\varvec{\theta }\) interchangeably, as well as \(\varvec{w}\) and \(\varvec{z}\).

### **Fact 1**

*For any*\(\epsilon >0\)

*, the function*\(g\left( \varvec{\gamma };\varvec{w}\right) =\frac{1}{2}\sqrt{\left[ h\left( \varvec{\gamma };\varvec{w}\right) \right] ^{2}+\epsilon }+\frac{1}{2}h\left( \varvec{\gamma };\varvec{w}\right)\),

*for any real-valued function*\(h\left( \varvec{\gamma };\varvec{w}\right)\),

*can*be majorized by

*at any valid inputs*\(\varvec{\delta }\)

*and*\(\varvec{w}\).

### **Fact 2**

*Let*\(g\left( \varvec{\gamma };\varvec{w}\right)\)

*be a real-valued function that is twice-differentiable in*\(\varvec{\gamma }\)

*for each valid input*\(\varvec{w}\).

*Let*\(\mathbf {H}\)

*be the Hessian of*\(g\left( \varvec{\gamma };\varvec{w}\right)\).

*If*\(\mathbf {H}-\partial g\left( \varvec{\gamma };\varvec{w}\right) /\partial \varvec{\gamma }\partial \varvec{\gamma }^{\top }\)

*is positive definite, for all*\(\varvec{\gamma }\in \Gamma\)

*and fixed*\(\varvec{w}\), then \(g\left( \varvec{\gamma };\varvec{w}\right)\)

*can be majorized by*

*at any valid inputs*\(\varvec{\delta }\)

*and*\(\varvec{w}\).

### 3.1 Hinge loss SMM

### 3.2 Squared-hinge loss SMM

The squared-hinge loss case \(n\text {th}\) step sub-problem for solving Problem (4) is obtained by making the substitutions \(g_{1}\left( \varvec{\theta };\varvec{z}\right) =l_{\text {S}}\left( \varvec{\theta };\varvec{z}\right)\) and \(g_{2}\left( \varvec{\theta };\varvec{z}\right) =\lambda \varvec{\beta }^{\top }\varvec{\beta }\). Although \(l_{\text {S}}\left( \varvec{\theta };\varvec{z}\right)\) is differentiable in \(\varvec{\theta }\) for any \(\varvec{z}\), it still does not meet the twice continuously differentiable criterion required by the SMM algorithm framework. We must thus devise an approximation for the squared-hinge loss that is suitable for our purpose.

Using the same identity as that which was used in “Hinge loss SMM”, we can write \(\left[ \gamma \right] _{+}^{2}=\left( \left| \gamma \right| /2+\gamma /2\right) ^{2}\), which can then be expanded to \(\left[ \gamma \right] _{+}^{2}=\gamma ^{2}/2+\gamma \left| \gamma \right| /2\). In this form, it is easy to see that for any small \(\epsilon >0\), we can approximate \(g\left( \gamma \right) =\left[ \gamma \right] _{+}^{2}\) by \(g_{\epsilon }\left( \gamma \right) =\left( \gamma ^{2}+\epsilon \right) /2+\gamma \sqrt{\gamma ^{2}+\epsilon }/2\).

Note the desirable property that \(g_{\epsilon }\left( \gamma \right) >0\) for any choice of \(\epsilon >0\), for all \(\gamma \in \mathbb {R}\) since, \(\lim _{\gamma \rightarrow -\infty }g_{\epsilon }\left( \gamma \right) =\epsilon /4\), \(\lim _{\gamma \rightarrow \infty }g_{\epsilon }\left( \gamma \right) =\infty\), and \({\text {d}}g_{\epsilon }\left( \gamma \right) /{\text {d}}\gamma =\left( \sqrt{\gamma ^{2}+\epsilon }+\gamma \right) ^{2}/\left( 2\sqrt{\gamma ^{2}+\epsilon }\right)\) is always positive.

### 3.3 Logistic loss SMM

The logistic loss case \(n\text {th}\) step sub-problem for solving Problem (4) is obtained by making the substitutions \(g_{1}\left( \varvec{\theta };\varvec{z}\right) =l_{\text {L}}\left( \varvec{\theta };\varvec{z}\right)\) and \(g_{2}\left( \varvec{\theta };\varvec{z}\right) =\lambda \varvec{\beta }^{\top }\varvec{\beta }\). Fortunately, unlike the hinge and the squared-hinge loss, the logistic loss \(l_{\text {L}}\left( \varvec{\theta };\varvec{z}\right)\) is twice continuously differentiable and does not require approximation. We now seek to obtain a majorizer for \(g_{1}\left( \varvec{\theta };\varvec{z}\right)\).

### 3.4 Computational remarks

*n*, and at no other iteration if one saves the previous iterate of the above given sums (i.e., \(\tilde{\mathbf {Y}}_{n-1}^{\top }\varvec{\Omega }_{n-1}^{-1}\tilde{\mathbf {Y}}_{n-1}\) and \(\tilde{\mathbf {Y}}_{n-1}^{\top }\varvec{\Omega }_{n-1}^{-1}\left( \mathbf {1}_{n-1}+\varvec{\omega }_{-1}\right)\)).

*n*.

Therefore, in all three of the algorithms above, one is not required to store the entire stream \(\left\{ \varvec{z}_{i}\right\}\) to compute the \(n\text {th}\) iteration of the algorithm. This is a significant memory and computational advantage when comparing the SMM algorithms to their batch counterparts.

## 4 Convergence analysis

### 4.1 General result

- (B1)
The function \(f\left( \varvec{\gamma }\right)\) is real valued and takes inputs \(\varvec{\gamma }\in \Gamma\), where \(\Gamma\) is a compact and convex set.

- (B2)
The function \(g_{1}\left( \varvec{\gamma };\varvec{w}\right)\) is twice continuously differentiable in \(\varvec{\gamma }\in \tilde{\Gamma }\), for each \(\varvec{w}\in \mathbb {W}\), where \(\tilde{\Gamma }\) is a bounded and open set such that \(\Gamma \subset \tilde{\Gamma }\).

- (B3)
The function \(g_{2}\left( \varvec{\gamma };\varvec{w}\right)\) is convex and continuous in \(\varvec{\gamma }\in \Gamma\), for each \(\varvec{w}\in \mathbb {W}\).

- (C1)
The function \(\tilde{g}_{1}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right)\) is real valued, takes inputs \(\varvec{\gamma }\in \tilde{\Gamma }\), and majorities \(g_{1}\left( \varvec{\gamma };\varvec{w}\right)\) at \(\varvec{\delta }\in \tilde{\Gamma }\) and \(\varvec{w}\in \mathbb {W}\) in the sense that (A1) and (A2) are satisfied.

- (C2)
The function \(\tilde{g}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right) =\tilde{g}_{1}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right) +g_{2}\left( \varvec{\gamma };\varvec{w}\right)\) is uniformly strongly convex in \(\varvec{\gamma }\in \Gamma\), in the sense that for all valid \(\varvec{\gamma },\varvec{\delta }\in \Gamma\) and \(\varvec{w}\in \mathbb {W}\), there exists a constant \(\mu >0\), such that \(\tilde{g}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right) -\frac{\mu }{2}\left( \varvec{\gamma }-\tilde{\varvec{\gamma }}\right) ^{\top }\left( \varvec{\gamma }-\tilde{\varvec{\gamma }}\right)\) is convex, for all \(\tilde{\varvec{\gamma }}\in \Gamma\) (cf. Mairal 2015).

- (D1)
The function \(\tilde{g}_{1}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right)\) is continuous in \(\varvec{\gamma }\in \tilde{\Gamma }\), for fixed \(\varvec{\delta }\in \tilde{\Gamma }\) and \(\varvec{w}\in \mathbb {W}\).

- (D2)The functions \(g_{1}\left( \varvec{\gamma };\varvec{w}\right)\) and \(\tilde{g}_{1}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right)\) and its derivatives are uniformly bounded, in the sense that there exists a constant \(K_{1}>0\), such that \(\left| g_{1}\left( \varvec{\gamma };\varvec{w}\right) \right| \le K_{1}\), \(\left| \tilde{g}_{1}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right) \right| \le K_{1}\),for every combination of valid \(\varvec{\gamma },\varvec{\delta }\in \tilde{\Gamma }\) and \(\varvec{w}\in \mathbb {W}\).$$\begin{aligned}&\left\| \frac{\partial g_{1}\left( \varvec{\gamma };\varvec{w}\right) }{\partial \varvec{\gamma }}\right\| \le K_{1},\; \left\| \frac{\partial \tilde{g}_{1}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right) }{\partial \varvec{\gamma }}\right\| \le K_{1}\text {, and, }\\&\left\| \frac{\partial ^{2}\tilde{g}_{1}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right) }{\partial \varvec{\gamma }\partial \varvec{\gamma }^{\top }}\right\| \le K_{1}, \end{aligned}$$
- (D3)
The function \(g_{2}\left( \varvec{\gamma };\varvec{w}\right)\) and its directional derivative are uniformly bounded, in the sense that there exists a constant \(K_{2}>0\) such that \(\left| g_{2}\left( \varvec{\gamma };\varvec{w}\right) \right| \le K_{2}\) and \(\left| {\text {d}}_{\varvec{v}}g_{2}\left( \varvec{\gamma };\varvec{w}\right) \right| \le K_{2}\left\| \varvec{v}\right\|\), for all valid inputs \(\varvec{\gamma }\in \Gamma\) and \(\varvec{w}\in \mathbb {W}\), and valid directions \(\varvec{v}\in \mathbb {R}^{d}\) such that \(\varvec{\gamma }+\varvec{v}\in \Gamma\).

- (D4)
There exists a constant \(G\ge 0\) such that \(\left| \tilde{g}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right) \right| \le G\), for all valid \(\varvec{\gamma },\varvec{\delta }\in \Gamma\) and \(\varvec{w}\in \mathbb {W}\), where \(\tilde{g}\left( \varvec{\gamma },\varvec{\delta };\varvec{w}\right)\) is as defined in (C4).

### **Theorem 1**

*If assumptions (B1)*–

*(B3), (C1), (C2), and (D1)*–

*(D4) hold for functions*\(g_{1}\left( \varvec{\gamma };\varvec{z}\right)\), \(g_{2}\left( \varvec{\gamma };\varvec{z}\right)\),

*and*\(\tilde{g}_{1}\left( \varvec{\gamma },\varvec{\delta };\varvec{z}\right)\),

*then the sequence of iterates*\(\left\{ \varvec{\gamma }^{i}\right\}\)

*of the SMM algorithm (Algorithm 1) converges to the set of stationary points of Problem*(5)

*in the sense that*

*where*\(\Gamma ^{*}\)

*is the set of stationary points of Problem*(5).

Theorem 1 provides a powerful convergence result that allows practitioners to be confident that the algorithm will produce meaningful results, provided enough data is made available through the input stream.

### 4.2 Application to the SMM algorithms for SVMs

We wish to apply Theorem 1 to conclude convergence results for our SMM algorithms for solving the approximate hinge loss, approximate squared-hinge loss, and the logistic loss SVMs that were derived in “SMM algorithms for SVMs”. Recall that \(\varvec{\gamma }\), \(\varvec{w}\), \(\Gamma\), and \(\mathbb {W}\) are interchangeable with \(\varvec{\theta }\), \(\varvec{z}\), \(\Theta\), and \(\mathbb {X}\times \left\{ -1,1\right\}\).

First, assumption (B1) can be satisfied in all cases by setting \(\Gamma \subset \mathbb {R}^{p}\) to be some hypercube \(\Theta =\left[ -a,a\right] ^{p+1}\) for some sufficiently large \(a>0\). We call this Assumption (B1*a*). Our approximate loss functions for the hinge and squared-hinge loss SVMs were constructed to satisfy (B2) and (B3), whereas the logistic loss function satisfies (B2) and (B3) naturally. There are further no issues with the convexity and continuity with \(g_{2}\left( \varvec{\theta };\varvec{z}\right) =\lambda \varvec{\beta }^{\top }\varvec{\beta }\), as it is simply a quadratic regularizer.

Assumption (C1) is satisfied for all three loss functions, as the surrogates \(\tilde{g}\left( \varvec{\theta },\varvec{\delta };\varvec{z}\right)\) are constructed to satisfy (A1) and (A2) in each case. Here, we can take \(\bar{\Theta }=\mathbb {R}^{p+1}\). Assumption (C2) is more difficult to assess, but is almost validated for all cases. Since all of the functions involved in the SMM algorithm are twice continuously differentiable, we can use the characterization that a function \(h\left( \varvec{\gamma }\right)\) taking inputs \(\varvec{\gamma }\in \Gamma\) is strongly convex if \(\partial ^{2}h\left( \varvec{\gamma }\right) /\partial \varvec{\gamma }\partial \varvec{\gamma }^{\top }-\mu \mathbf {I}_{p}\) is positive definite for some \(\mu >0\) (cf. Boyd and Vandenberghe 2004, Sect. 9.1.2). In other words, the smallest eigenvalue of the Hessian of \(h\left( \varvec{\gamma }\right)\) is lower bounded by the constant \(\mu\), for all \(\varvec{\gamma }\in \Gamma\). Given the previous definition, Assumption (C2) as applied to the approximate hinge loss, approximate squared-hinge loss, and logistic loss functions can be stated as: (C2H) the smallest eigenvalue of \(\frac{1}{2n}\tilde{\mathbf {Y}}_{n}^{\top }\varvec{\Omega }_{n}^{-1}\tilde{\mathbf {Y}}_{n}+2\lambda \tilde{\mathbf {I}}_{p}\) is lower bounded by some \(\mu >0\), (C2S) the smallest eigenvalue of \(\frac{2}{n}\tilde{\mathbf {Y}}_{n}^{\top }\tilde{\mathbf {Y}}_{n}+2\lambda \tilde{\mathbf {I}}_{p}\) is lower bounded by some \(\mu >0\), and (C2L) the smallest eigenvalue of \(\frac{1}{4n}\tilde{\mathbf {Y}}_{n}^{\top }\tilde{\mathbf {Y}}_{n}+2\lambda \tilde{\mathbf {I}}_{p}\) is lower bounded by some \(\mu >0\), respectively. Note that although we must make the explicit assumptions (C2H), (C2L), or (C2S), for the respective loss functions, we can be very confident that they are satisfied, since if \(\tilde{\mathbf {I}}_{p}\) were to be replaced by \(\mathbf {I}_{p+1}\) in each of the Hessian expressions, we would have strong convexity in every case by setting \(\mu =2\lambda\).

Next, we can check that (D1) is fulfilled for all three loss functions by construction. Furthermore, since each of the surrogate functions is quadratic and thus smooth, (D2) is fulfilled if (B1*a*) is assumed. Similarly, since the penalty \(g_{2}\left( \varvec{\theta };\varvec{z}\right)\) is a quadratic and thus smooth, (D3) is also fulfilled if (B1*a*) is assumed. Finally, since \(\tilde{g}_{1}\left( \varvec{\theta };\varvec{\delta },\varvec{z}\right)\) is continuous in every case and fulfill (D1), (D4) is automatically satisfied if (B1*a*) were assumed. We can, therefore, state the following convergence result regarding the three SMM algorithms.

### **Proposition 1**

*Under Assumptions (B1**a**), and (C2H), (C2L), or (C2S), the SMM algorithms for the SVM problems with approximate hinge loss, approximate squared-hinge loss, and logistic loss, as defined by the use of* \(n\text {th}\) *step iterates* \(\varvec{\theta }^{n}\leftarrow \varvec{\theta }_{\text {H}}^{n}\), \(\varvec{\theta }^{n}\leftarrow \varvec{\theta }_{\text {S}}^{n}\), *and* \(\varvec{\theta }^{n}\leftarrow \varvec{\theta }_{\text {L}}^{n}\) *in Algorithm 1 permit the conclusion of Theorem* 1 *with* \(\left\{ \varvec{\gamma }^{i}\right\}\) *replaced by the respective sequence of SMM iterates* \(\left\{ \varvec{\theta }^{i}\right\}\) *and* \(\Gamma ^{*}\) *replaced by the set of stationary points of the respective SVM problems* \(\Theta ^{*}\).

Let \(\tilde{\Theta }\) be a convex and open subset of \(\Theta\) under Assumption (B1*a*). Via Chebyshev’s inequality, and under the assumptions of Proposition 1, \(\tilde{f}_{n}\left( \varvec{\theta }\right)\) approaches \(f\left( \varvec{\theta }\right)\) in probability as \(n\rightarrow \infty\) for each \(\varvec{\theta }\in \tilde{\Theta }\) and for any choice of loss function that we have considered, where the regularization \(g_{2}\left( \varvec{\theta };\varvec{z}\right) =\lambda \varvec{\beta }^{\top }\varvec{\beta }\). Furthermore, for any choice of loss function along with the quadratic regularization term, we can observe that \(\tilde{f}_{n}\left( \varvec{\theta }\right)\) is convex, since it is the sum of positive convex functions. Application of the convexity lemma from Pollard (1991) then yields the following conclusion.

### **Corollary 1**

*Let* \(\tilde{\Theta }\) *be the interior of* \(\Theta\) *under (B1**a**). Under the assumptions of Proposition* 1, \(f\left( \varvec{\theta }\right)\) *is convex on* \(\tilde{\Theta }\) *for any loss function* \(g_{1}\left( \varvec{\theta };\varvec{z}\right)\) *that was considered in Proposition* 1 *and quadratic regularization* \(g_{2}\left( \varvec{\theta };\varvec{z}\right) =\lambda \varvec{\beta }^{\top }\varvec{\beta }\), *where* \(\lambda >0\). *Furthermore, since* \(f\left( \varvec{\theta }\right)\) *is convex, the set of stationary points* \(\Theta ^{*}\) *for each problem is therefore the set of global minimizers of the respective problem on the set* \(\tilde{\Theta }\).

## 5 Numerical simulations

The SMM algorithms that were described in “SMM algorithms for SVMs” are implemented in the *R *programming environment (R Core Team, 2016), with particularly computationally intensive loops programmed in *C *and integrated via the *Rcpp *package (Eddelbuettel, 2013). The implementations can be freely obtained at github.com/andrewthomasjones. All computations were conducted on a MacBook Pro with a 2.2 GHz Intel Core i7 processor, 16 GB of 1600 MHz DDR3 memory, and a 500 GB SSD. Computational times that are reported are obtained via the *proc.time()* function. Through prior experimentation, we have found that setting \(\epsilon =10^{-5}\) and \(\lambda =1/N\) yields good results in practice, where *N* is defined in the sequel. As such, for all of our numerical computations, these are the settings that we utilize.

### 5.1 Simulation 1

We sample streams of *N* observations \(\left\{ \varvec{z}_{i}\right\}\), where each \(\varvec{z}_{i}\) (\(i\in \left[ N\right]\)) is a realization of the random variable \(\varvec{Z}\sim \mathcal {L}_{\varvec{Z}}\). The law \(\mathcal {L}_{\varvec{Z}}\) is defined in the following hierarchical manner. First, \(Y\in \left\{ -1,1\right\}\) is generated with equal probability (i.e. \(\mathbb {P}\left( Y=-1\right) =\mathbb {P}\left( Y=1\right) =1/2\)). Next, conditional on \(Y=y\), \(\varvec{X}\) is generated from a \(p{\text {-dimensional}}\) multivariate Gaussian distribution with mean \(\Delta y\) and identity covariance matrix. The three factors that can be varied for the simulation are *N*, *p*, and \(\Delta\). Here, we choose to simulate the scenarios \(N\in \left\{ 1\times 10^{4},5\times 10^{4},1\times 10^{5}\right\}\), \(p\in \left\{ 5,10,20\right\}\), and \(\Delta \in \left\{ 0.125,0.25,0.5\right\}\).

### 5.2 Comparisons

Each of our three SMM SVM algorithms are assessed based on two factors. First, they are assessed based on the required computational time required to compute \(\hat{\varvec{\theta }}\), where \(\hat{\varvec{\theta }}\) is defined as the estimator of \(\varvec{\theta }\) for each of the SVM problems computed over a single sweep of the stream \(\left\{ \varvec{z}_{i}\right\}\). That is, each of the *n* observations from the stream \(\left\{ \varvec{z}_{i}\right\}\) is only accessed once.

Secondly, using a test set, we measure the accuracy of the classification rule \(\hat{y}={\text {sign}}\left( \tilde{\varvec{x}}^{\top }\hat{\varvec{\theta }}\right)\) for each of the SMM-fitted approximate hinge loss, approximate square-hinge loss, and logistic loss SVMs. We shall refer to these three SVMs as SMMH, SMMS, and SMML, from hereon in. For each simulation scenario, the computational time and accuracy are measured over 10 repetitions each and then averaged to yield appropriately precise measures of comparison.

Along with the three SVM algorithms that are described in “SMM algorithms for SVMs”, we also assess performances of the hinge loss, square-hinge loss, and logistic loss SVMs (with \(P\left( \varvec{\beta }\right) =\lambda \varvec{\beta }^{\top }\varvec{\beta }\)) as fitted by the *LIBLINEAR* package of solvers of Fan et al. (2008), applied via the *LiblineaR* package of Helleputte (2017). The hinge loss SVM can be fitted via a dual optimization routine (LIBHD), and the square-hinge and logistic loss SVMs can be fitted via both dual and primal optimization routines (LIBSD and LIBSP, and LIBLD and LIBLP). We note that the five optimization routines for fitting the three different SVM varieties from the *LIBLINEAR* package are batch methods and require the entire data set to be maintained in storage contemporaneously. These algorithms are therefore not directly comparable to the SMM algorithms. We have included the *LIBLINEAR* algorithms results to provide a “gold-standard” benchmark against which we can compare our algorithms.

For a stream-suitable algorithm benchmark, we also compare our methods to the PEGASOS algorithm of Shalev-Shwartz et al. (2011) as applied in *R* via a modified implementation of the codes obtained from github.com/wrathematics. The PEGASOS algorithm method solves the hinge loss SVM problem with stochastic sub-gradient descent and is run in its streamed form, where each \(\varvec{z}_{i}\) of \(\left\{ \varvec{z}_{i}\right\}\) is utilized once and in the order of its arrival. The algorithm is terminated upon having used up all *N* observations of the stream, as to make it comparable to SMMH, SMMS, and SMML. Each of the *LIBLINEAR* algorithms and PEGASOS are also compared to the three methods from “SMM algorithms for SVMs” based on the average computational time required to compute \(\hat{\varvec{\theta }}\) and the accuracy of the constructed classifier.

### 5.3 Results of Simulation 1

Average computation times (in seconds) for Simulation 1 from 10 repetitions

\(\Delta\) | | | LIBHD | LIBSD | LIBLD | SMMH | SMMS | SMML | LIBSP | LIBLP | PEGASOS |
---|---|---|---|---|---|---|---|---|---|---|---|

0.125 | 1.00E+04 | 5 | 0.88 | 5.73 | 0.71 | 0.16 | 0.17 | 0.17 | 0.00 | 0.00 | 0.00 |

0.125 | 1.00E+04 | 10 | 1.24 | 8.21 | 1.04 | 0.50 | 0.53 | 0.53 | 0.00 | 0.00 | 0.00 |

0.125 | 1.00E+04 | 20 | 1.92 | 13.47 | 1.55 | 1.92 | 2.08 | 2.07 | 0.01 | 0.01 | 0.00 |

0.125 | 5.00E+04 | 5 | 5.42 | 33.39 | 4.23 | 0.82 | 0.84 | 0.85 | 0.01 | 0.02 | 0.00 |

0.125 | 5.00E+04 | 10 | 8.24 | 48.45 | 5.98 | 2.63 | 2.71 | 2.70 | 0.03 | 0.03 | 0.01 |

0.125 | 5.00E+04 | 20 | 12.93 | 80.00 | 9.19 | 10.46 | 10.58 | 10.57 | 0.06 | 0.07 | 0.02 |

0.125 | 1.00E+05 | 5 | 13.39 | 74.88 | 10.16 | 1.71 | 1.72 | 1.71 | 0.04 | 0.05 | 0.01 |

0.125 | 1.00E+05 | 10 | 19.04 | 102.25 | 13.02 | 5.35 | 5.42 | 5.40 | 0.06 | 0.07 | 0.02 |

0.125 | 1.00E+05 | 20 | 28.00 | 170.27 | 19.40 | 21.06 | 20.97 | 20.96 | 0.11 | 0.14 | 0.04 |

0.25 | 1.00E+04 | 5 | 1.03 | 5.96 | 0.79 | 0.17 | 0.17 | 0.17 | 0.00 | 0.01 | 0.00 |

0.25 | 1.00E+04 | 10 | 1.31 | 7.37 | 1.04 | 0.53 | 0.55 | 0.55 | 0.00 | 0.01 | 0.00 |

0.25 | 1.00E+04 | 20 | 1.54 | 8.24 | 1.56 | 1.98 | 2.10 | 2.10 | 0.01 | 0.01 | 0.00 |

0.25 | 5.00E+04 | 5 | 6.15 | 39.91 | 5.00 | 0.86 | 0.85 | 0.85 | 0.02 | 0.03 | 0.01 |

0.25 | 5.00E+04 | 10 | 7.53 | 42.07 | 6.15 | 2.69 | 2.70 | 2.69 | 0.03 | 0.04 | 0.01 |

0.25 | 5.00E+04 | 20 | 8.75 | 45.42 | 8.46 | 10.33 | 10.61 | 10.61 | 0.06 | 0.08 | 0.02 |

0.25 | 1.00E+05 | 5 | 11.83 | 63.72 | 8.30 | 1.72 | 1.68 | 1.67 | 0.03 | 0.05 | 0.01 |

0.25 | 1.00E+05 | 10 | 14.89 | 80.21 | 11.48 | 5.32 | 5.15 | 5.16 | 0.06 | 0.07 | 0.02 |

0.25 | 1.00E+05 | 20 | 16.62 | 83.35 | 16.27 | 20.28 | 20.45 | 20.43 | 0.10 | 0.13 | 0.03 |

0.5 | 1.00E+04 | 5 | 0.74 | 3.07 | 0.64 | 0.17 | 0.16 | 0.16 | 0.00 | 0.00 | 0.00 |

0.5 | 1.00E+04 | 10 | 0.56 | 2.54 | 0.83 | 0.49 | 0.51 | 0.51 | 0.00 | 0.01 | 0.00 |

0.5 | 1.00E+04 | 20 | 0.47 | 0.96 | 1.10 | 1.87 | 2.06 | 2.05 | 0.01 | 0.01 | 0.00 |

0.5 | 5.00E+04 | 5 | 3.53 | 17.95 | 3.62 | 0.85 | 0.82 | 0.81 | 0.02 | 0.03 | 0.00 |

0.5 | 5.00E+04 | 10 | 2.96 | 14.49 | 4.96 | 2.59 | 2.56 | 2.56 | 0.03 | 0.05 | 0.01 |

0.5 | 5.00E+04 | 20 | 1.89 | 6.24 | 6.47 | 9.71 | 10.26 | 10.25 | 0.05 | 0.08 | 0.01 |

0.5 | 1.00E+05 | 5 | 7.63 | 40.03 | 8.01 | 1.71 | 1.63 | 1.62 | 0.04 | 0.05 | 0.01 |

0.5 | 1.00E+05 | 10 | 6.44 | 29.33 | 10.29 | 5.12 | 5.19 | 5.18 | 0.06 | 0.09 | 0.01 |

0.5 | 1.00E+05 | 20 | 4.22 | 13.80 | 13.82 | 19.59 | 20.52 | 20.53 | 0.11 | 0.16 | 0.03 |

Average training accuracies for Simulation 1 from 10 repetitions

\(\Delta\) | | | LIBHD | LIBSD | LIBLD | SMMH | SMMS | SMML | LIBSP | LIBLP | PEGASOS |
---|---|---|---|---|---|---|---|---|---|---|---|

0.125 | 1.00E+04 | 5 | 0.61 | 0.61 | 0.61 | 0.59 | 0.61 | 0.61 | 0.61 | 0.61 | 0.56 |

0.125 | 1.00E+04 | 10 | 0.65 | 0.65 | 0.65 | 0.65 | 0.65 | 0.65 | 0.65 | 0.65 | 0.59 |

0.125 | 1.00E+04 | 20 | 0.71 | 0.71 | 0.71 | 0.70 | 0.71 | 0.71 | 0.71 | 0.71 | 0.63 |

0.125 | 5.00E+04 | 5 | 0.61 | 0.61 | 0.61 | 0.61 | 0.61 | 0.61 | 0.61 | 0.61 | 0.54 |

0.125 | 5.00E+04 | 10 | 0.65 | 0.65 | 0.65 | 0.65 | 0.65 | 0.65 | 0.65 | 0.65 | 0.56 |

0.125 | 5.00E+04 | 20 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.62 |

0.125 | 1.00E+05 | 5 | 0.61 | 0.61 | 0.61 | 0.61 | 0.61 | 0.61 | 0.61 | 0.61 | 0.54 |

0.125 | 1.00E+05 | 10 | 0.65 | 0.65 | 0.65 | 0.65 | 0.65 | 0.65 | 0.65 | 0.65 | 0.57 |

0.125 | 1.00E+05 | 20 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.63 |

0.25 | 1.00E+04 | 5 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.63 |

0.25 | 1.00E+04 | 10 | 0.78 | 0.78 | 0.78 | 0.78 | 0.78 | 0.78 | 0.78 | 0.78 | 0.70 |

0.25 | 1.00E+04 | 20 | 0.87 | 0.87 | 0.87 | 0.86 | 0.87 | 0.87 | 0.87 | 0.87 | 0.81 |

0.25 | 5.00E+04 | 5 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.63 |

0.25 | 5.00E+04 | 10 | 0.78 | 0.78 | 0.78 | 0.78 | 0.78 | 0.78 | 0.78 | 0.78 | 0.73 |

0.25 | 5.00E+04 | 20 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.82 |

0.25 | 1.00E+05 | 5 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.65 |

0.25 | 1.00E+05 | 10 | 0.79 | 0.79 | 0.79 | 0.78 | 0.79 | 0.79 | 0.79 | 0.79 | 0.71 |

0.25 | 1.00E+05 | 20 | 0.87 | 0.87 | 0.87 | 0.86 | 0.87 | 0.87 | 0.87 | 0.87 | 0.82 |

0.5 | 1.00E+04 | 5 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.79 |

0.5 | 1.00E+04 | 10 | 0.94 | 0.94 | 0.94 | 0.94 | 0.94 | 0.94 | 0.94 | 0.94 | 0.91 |

0.5 | 1.00E+04 | 20 | 0.99 | 0.99 | 0.99 | 0.98 | 0.99 | 0.99 | 0.99 | 0.99 | 0.98 |

0.5 | 5.00E+04 | 5 | 0.87 | 0.87 | 0.87 | 0.86 | 0.87 | 0.87 | 0.87 | 0.87 | 0.82 |

0.5 | 5.00E+04 | 10 | 0.94 | 0.94 | 0.94 | 0.94 | 0.94 | 0.94 | 0.94 | 0.94 | 0.90 |

0.5 | 5.00E+04 | 20 | 0.99 | 0.99 | 0.99 | 0.98 | 0.99 | 0.99 | 0.99 | 0.99 | 0.98 |

0.5 | 1.00E+05 | 5 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.81 |

0.5 | 1.00E+05 | 10 | 0.94 | 0.94 | 0.94 | 0.94 | 0.94 | 0.94 | 0.94 | 0.94 | 0.92 |

0.5 | 1.00E+05 | 20 | 0.99 | 0.99 | 0.99 | 0.98 | 0.99 | 0.99 | 0.99 | 0.99 | 0.98 |

From Table 1, we observe the SMM algorithms tend to increase in computational time due to increases in *N* and *p*. Increases in *N* tend to yield a linear increase in computational time; however, increases in *p* yield nonlinear increases. Given each of the expressions (10), (14), and (17), we can write the time complexity of the SMM as \(O\left( np^{3}\right)\) which is congruent to our observations. We notice that the SMM algorithms are not at all affected by the class separation \(\Delta\). All three of the SMM algorithms require approximately the same amount of computational time in all cases.

In contrast, we observe that the dual algorithms LIBHD, LIBSD, and LIBLD have computational times that are decreasing as \(\Delta\) increases. These algorithms also exhibit increasing computational time as *N* and *p* increase, although there are some nonlinearities such as computational times decreasing as *p* increases for fixed *N* and \(\Delta\). This occurs in the cases where \(\Delta =0.5\). Other than these effects, we notice that the SMM algorithms are comparable to the dual algorithms when *p* is 5 or 10 and are slower when \(p=20\) for \(\Delta =0.5\). For \(\Delta =0.25\), the same statement holds except when compared to LIBSD, which is much slower than all other algorithms in all cases of *p* and *N*. For \(\Delta =0.125\), we observe that the SMM algorithms are faster than their dual counterparts except in the case of \(N=10{,}000\) and \(p=20\), where the LIBHD and LIBLD algorithms are faster by a small amount. The LIBLD algorithm is also substantially slower than the other algorithms in all cases here.

The three algorithms LIBSP, LIBLP, and PEGASOS are all multiple orders of magnitude faster than the SMM algorithms in all cases. This is likely due to their conjugate gradient and coordinate-descent forms, and optimized implementations, rather than the Newton-like iterations and ad-hoc implementations of our SMM algorithms.

From Table 2, we observe that the accuracy of the SMM algorithms is in most cases equal to the accuracy of the five batch algorithms. This is very surprising as the SMM algorithm is only permitted to inspect each observation from the stream \(\left\{ \varvec{z}_{i}\right\}\) once, whereas the batch algorithms are permitted as many inspections of the observations as are required for convergence. Only the SMMH algorithm is less accurate than the batch algorithms in some cases. In situations, where it is less accurate, the deficit is only 0.01 or 0.02, which we find tolerable given the streaming nature of the algorithm. The PEGASOS algorithm is substantially less accurate than all of the other algorithms when implemented in its streaming configuration. Given this fact, one is faced with a tradeoff between speed and accuracy when comparing PEGASOS to the SMM algorithms.

### 5.4 Simulation 2

In this second set of simulations, we consider simulations of some larger data sets. The simulations are conducted as described in “Simulation 1”. However, we now choose to simulate the scenarios \(N\in \left\{ 5\times 10^{5},1\times 10^{6},5\times 10^{6}\right\}\), \(p\in \left\{ 10,20,50\right\}\), and \(\Delta \in \left\{ 0.125,0.25,0.5\right\}\).

### 5.5 Results of Simulation 2

Average computation times (in seconds) for Simulation 2 from 10 repetitions

\(\Delta\) | | | SMMH | SMMS | SMML | LIBSP | LIBLP | PEGASOS |
---|---|---|---|---|---|---|---|---|

0.125 | 5.00E+05 | 10 | 1.42 | 1.91 | 1.93 | 0.33 | 0.33 | 0.05 |

0.125 | 5.00E+05 | 20 | 4.43 | 6.49 | 6.49 | 0.64 | 0.76 | 0.10 |

0.125 | 5.00E+05 | 50 | 30.77 | 49.03 | 49.02 | 1.45 | 1.73 | 0.17 |

0.125 | 1.00E+06 | 10 | 2.59 | 3.51 | 3.51 | 0.66 | 0.77 | 0.09 |

0.125 | 1.00E+06 | 20 | 8.76 | 12.57 | 12.57 | 1.58 | 2.00 | 0.16 |

0.125 | 1.00E+06 | 50 | 64.78 | 103.06 | 102.54 | 3.67 | 4.44 | 0.40 |

0.125 | 5.00E+06 | 10 | 12.81 | 17.66 | 17.66 | 3.41 | 3.85 | 0.46 |

0.125 | 5.00E+06 | 20 | 43.47 | 61.97 | 62.28 | 8.66 | 9.36 | 0.80 |

0.125 | 5.00E+06 | 50 | 320.11 | 510.16 | 512.75 | 27.75 | 27.48 | 2.02 |

0.25 | 5.00E+05 | 10 | 1.39 | 1.90 | 1.88 | 0.52 | 0.59 | 0.05 |

0.25 | 5.00E+05 | 20 | 4.36 | 6.22 | 6.26 | 0.87 | 1.10 | 0.07 |

0.25 | 5.00E+05 | 50 | 32.16 | 51.67 | 51.69 | 2.13 | 2.66 | 0.19 |

0.25 | 1.00E+06 | 10 | 2.77 | 3.79 | 3.79 | 1.15 | 1.25 | 0.10 |

0.25 | 1.00E+06 | 20 | 8.70 | 12.46 | 12.47 | 1.80 | 2.18 | 0.15 |

0.25 | 1.00E+06 | 50 | 61.34 | 97.55 | 96.89 | 3.17 | 3.98 | 0.35 |

0.25 | 5.00E+06 | 10 | 12.68 | 17.46 | 17.38 | 4.51 | 5.15 | 0.44 |

0.25 | 5.00E+06 | 20 | 39.98 | 57.60 | 57.98 | 6.71 | 9.10 | 0.73 |

0.25 | 5.00E+06 | 50 | 299.10 | 477.24 | 477.83 | 24.80 | 23.16 | 1.74 |

0.5 | 5.00E+05 | 10 | 1.28 | 1.77 | 1.74 | 0.33 | 0.48 | 0.04 |

0.5 | 5.00E+05 | 20 | 4.02 | 5.75 | 5.79 | 0.64 | 1.05 | 0.07 |

0.5 | 5.00E+05 | 50 | 29.91 | 47.71 | 47.69 | 1.61 | 2.05 | 0.16 |

0.5 | 1.00E+06 | 10 | 2.55 | 3.47 | 3.46 | 0.83 | 1.17 | 0.08 |

0.5 | 1.00E+06 | 20 | 8.01 | 11.53 | 11.57 | 1.39 | 2.05 | 0.14 |

0.5 | 1.00E+06 | 50 | 59.90 | 95.64 | 95.52 | 2.72 | 4.16 | 0.34 |

0.5 | 5.00E+06 | 10 | 12.62 | 17.33 | 17.34 | 4.22 | 5.80 | 0.42 |

0.5 | 5.00E+06 | 20 | 43.07 | 62.35 | 61.80 | 7.90 | 10.96 | 0.70 |

0.5 | 5.00E+06 | 50 | 319.80 | 504.86 | 503.23 | 24.23 | 31.03 | 1.88 |

Average training accuracies for Simulation 2 from 10 repetitions

\(\Delta\) | | | SMMH | SMMS | SMML | LIBSP | LIBLP | PEGASOS |
---|---|---|---|---|---|---|---|---|

0.125 | 5.00E+05 | 10 | 0.65 | 0.65 | 0.65 | 0.65 | 0.65 | 0.56 |

0.125 | 5.00E+05 | 20 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.62 |

0.125 | 5.00E+05 | 50 | 0.81 | 0.81 | 0.81 | 0.81 | 0.81 | 0.74 |

0.125 | 1.00E+06 | 10 | 0.65 | 0.65 | 0.65 | 0.65 | 0.65 | 0.57 |

0.125 | 1.00E+06 | 20 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.62 |

0.125 | 1.00E+06 | 50 | 0.81 | 0.81 | 0.81 | 0.81 | 0.81 | 0.74 |

0.125 | 5.00E+06 | 10 | 0.65 | 0.65 | 0.65 | 0.65 | 0.65 | 0.56 |

0.125 | 5.00E+06 | 20 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.64 |

0.125 | 5.00E+06 | 50 | 0.81 | 0.81 | 0.81 | 0.81 | 0.81 | 0.74 |

0.25 | 5.00E+05 | 10 | 0.78 | 0.79 | 0.79 | 0.79 | 0.79 | 0.71 |

0.25 | 5.00E+05 | 20 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.81 |

0.25 | 5.00E+05 | 50 | 0.96 | 0.96 | 0.96 | 0.96 | 0.96 | 0.94 |

0.25 | 1.00E+06 | 10 | 0.78 | 0.79 | 0.79 | 0.79 | 0.79 | 0.72 |

0.25 | 1.00E+06 | 20 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.81 |

0.25 | 1.00E+06 | 50 | 0.96 | 0.96 | 0.96 | 0.96 | 0.96 | 0.94 |

0.25 | 5.00E+06 | 10 | 0.78 | 0.79 | 0.79 | 0.79 | 0.79 | 0.70 |

0.25 | 5.00E+06 | 20 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.82 |

0.25 | 5.00E+06 | 50 | 0.96 | 0.96 | 0.96 | 0.96 | 0.96 | 0.94 |

0.5 | 5.00E+05 | 10 | 0.94 | 0.94 | 0.94 | 0.94 | 0.94 | 0.92 |

0.5 | 5.00E+05 | 20 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.98 |

0.5 | 5.00E+05 | 50 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |

0.5 | 1.00E+06 | 10 | 0.94 | 0.94 | 0.94 | 0.94 | 0.94 | 0.90 |

0.5 | 1.00E+06 | 20 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.98 |

0.5 | 1.00E+06 | 50 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |

0.5 | 5.00E+06 | 10 | 0.94 | 0.94 | 0.94 | 0.94 | 0.94 | 0.91 |

0.5 | 5.00E+06 | 20 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.98 |

0.5 | 5.00E+06 | 50 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |

From Table 3, we notice the same relationships between computational time and *N*, *p*, and \(\Delta\) as in Simulation 1. We also notice that the SMMH algorithm is faster than the SMMS and SMML algorithms in all cases. Upon inspection, it appears that the SMM algorithms are approximately between one and two orders of magnitude slower than the batch algorithms LIBSP and LIBLP in each scenario. The PEGASOS algorithm is then another order of magnitude faster than the batch algorithms.

From Table 4, we observe again, as with Simulation 1, that the SMM algorithms are nearly always equal in accuracy to the batch algorithms. The only exception is that the SMMH algorithm has an accuracy deficit of 0.01 in some occasions. The PEGASOS algorithm is once again significantly less accurate than the other algorithms. Thus, we are again faced once more with a choice between speed and accuracy when choosing between the PEGASOS and SMM algorithms.

We finally remark that making comparisons between the computational times of the SMM algorithms and the batch algorithms is only for the sake of benchmarking and may be misleading to interpret from a practical point of view. This is because the two classes of algorithms are constructed to perform two seemingly similar but fundamentally different learning tasks. If one considers the computational time per iteration of the SMM algorithms, it would yield a more practically meaningful index of performance as it would be more congruous with the intended applicational setting of said algorithms.

### *Remark 1*

From the joint results of the two simulations (Simulations 1 and 2), we observe that SMMH is generally markedly faster than SMMS and SMML. However, the additional speed of the SMMH algorithm appears to necessitate a small tradeoff to accuracy, as we observe that SMMS and SMML yield slightly higher accuracy compared to SMMH, in a small number of scenarios. Thus, in applications, we recommend the use of SMMH when speed is the only imperative, whereas SMMS and SMML may be more appropriate when a balance between speed and accuracy is required.

### *Remark 2*

We can make the following recommendations, from the above comparisons between the algorithms from the *LIBLINEAR* package, the SMM algorithms, and PEGASOS. First, it is best not to use LIBHD, LIBSD, and LIBLD in any circumstance, as they require excessive computational time without providing better accuracy in return, when compared to the other methods. Second, for batch data, the LIBSP and LIBLP algorithms are preferred to the SMM algorithms, because they have been highly optimized for operation in such circumstances and are both fast and accurate when applied to large batch samples. Next, the SMM algorithms are able to achieve nearly identical accuracy levels to LIBSP and LIBLP while operating on streamed data. The algorithms are, however, slower when the data are acquired in batch. Finally, if computational time is the only imperative, then PEGASOS should be preferred to all of the other algorithms, whether the data are acquired in batch or whether the data are streamed. However, PEGASOS is significantly slower than the other methods, especially when class separation and sample size are small.

## 6 MNIST data analysis

We seek to use the training set \(\left\{ \varvec{\zeta }_{i}\right\} _{\text {Train}}\) to construct an SVM classifier that can accurately distinguish zero from nonzero images from the test set \(\left\{ \varvec{\zeta }_{i}\right\} _{\text {Test}}\). We shall compare the performance of the three SMM algorithms along with the LIBSP, LIBLP, and PEGASOS algorithms on this task. The performance indicators are the average test set accuracy from 10 repetitions of the algorithms, from a random ordering of the training set, and the average computational time in seconds.

### 6.1 Preprocessing

Before progressing with our analyses, we first reduce the dimensionality of our training and testing data sets. Using the training data \(\left\{ \varvec{\zeta }_{i}\right\} _{\text {Train}}\), we perform a principal component analysis (PCA; see for example Jolliffe 2002) decomposition of the raw intensities \(\varvec{\xi }_{i}\) to yield the principal component (PC) features \(\varvec{x}_{i}\in \mathbb {R}^{p}\), where \(\varvec{x}_{i}\) contains the first *p* PCs. We select \(p\in \left\{ 10,20,50\right\}\) for our experiments to construct the preprocessed training and test sets \(\left\{ \varvec{z}_{i}\right\} _{\text {Train}}\) and \(\left\{ \varvec{z}_{i}\right\} _{\text {Test}}\), where \(\varvec{z}_{i}^{\top }=\left( y_{i},\varvec{x}_{i}^{\top }\right)\). All constructions of classifiers and reporting of classifier performances are based on the use of these preprocessed data.

### 6.2 Results

Average computation times (in seconds) for the MNIST task from 10 repetitions

| SMMH | SMMS | SMML | LIBSP | LIBLP | PEGASOS |
---|---|---|---|---|---|---|

10 | 0.16 | 0.22 | 0.22 | 0.31 | 0.35 | 0.01 |

20 | 0.49 | 0.72 | 0.71 | 0.63 | 1.21 | 0.02 |

50 | 3.66 | 5.87 | 5.84 | 2.77 | 4.37 | 0.05 |

Average testing accuracies for the MNIST task from 10 repetitions

| SMMH | SMMS | SMML | LIBSP | LIBLP | PEGASOS |
---|---|---|---|---|---|---|

10 | 0.96 | 0.97 | 0.97 | 0.97 | 0.97 | 0.52 |

20 | 0.97 | 0.97 | 0.97 | 0.98 | 0.99 | 0.49 |

50 | 0.97 | 0.98 | 0.98 | 0.99 | 0.99 | 0.48 |

From Table 5, we observe that the SMM algorithms are marginally faster than the batch algorithms when \(p=10\) and comparable to the batch algorithms when *p* is 20 or 50. The approximate hinge loss algorithm SMMH is faster than the other two SMM algorithms in all three cases of *p*. The PEGASOS algorithm is faster than all of the other algorithms by an order of magnitude in the \(p=10\) case and two orders of magnitude when *p* is 20 or 50. We note that the better performance of the SMM algorithm in this task, as compared to the batch algorithms in Simulations 1 and 2, may be due to the lack of balance in the data set between \(y_{i}=-1\) and \(y_{i}=1\) observations. Here, the ratio is approximately one to nine, whereas the ratio in the simulations is one to one.

From Table 6, we observe that the SMM algorithms are nearly as accurate as the two batch algorithms. Where they are not as accurate, the difference in accuracy is only 0.01 or 0.02. This is a good result as it demonstrates that there is little to be lost from learning in a streamed environment in contrast to requiring the data be available in batch. We note that the PEGASOS algorithm is once again significantly less accurate than all other algorithms. Here, PEGASOS as implemented, performs worse than simply guessing in proportion to the ratio of the classes.

## 7 Conclusions

In modern data analysis, there is a need for the development of new algorithms that are operable within the Big Data context. One prevalent notion of Big Data is that it is defined via the three Vs: variety, velocity, and volume. Whereas variety must be considered when choosing a model for any particular task, the fitting of the model and conducting of said task are required to be amenable to high velocity data of large volume (e.g. streamed data that cannot be stored in memory simultaneously).

The SMM framework as, proposed by Razaviyayn et al. (2016), provides a useful paradigm for constructing algorithms that can cater to the analysis of large volumes of streamed data. Using the SMM framework, we have constructed three algorithms SMMH, SMMS, and SMML that solve the approximate hinge loss, approximate squared-hinge loss, and logistic loss SVM problems in streamed data setting, respectively.

Using the theoretical results of Razaviyayn et al. (2016), we have validated that the three constructed algorithms are convergent under mild regularity conditions and that they converge to globally minimal solutions. Two simulation studies demonstrate that the SVMs obtained via the constructed SMM algorithms are comparable in accuracy to state-of-the-art batch algorithms from the *LIBLINEAR* package, and also outperform the leading stream algorithm PEGASOS. With respect to timing, the SMM algorithms are found to be fast but not as fast as the *LIBLINEAR *or PEGASOS algorithms. It is difficult to make comparisons between the SMM algorithms and the *LIBLINEAR* algorithms based on speed as they are performing two similar but overall different tasks. However, the difference compared to PEGASOS implies that there is a tradeoff between speed and accuracy when choosing between the SMM algorithms and PEGASOS.

Finally, we conducted an example analysis of the MNIST data set of LeCun (1998). Here, we again find that the SMM algorithms are comparable in accuracy to the *LIBLINEAR* algorithms and surpass the accuracy of PEGASOS. In this real data setting, the SMM algorithm was also found to be somewhat more comparable in computational timing performance to the *LIBLINEAR* algorithms. PEGASOS is again faster than the SMM algorithms, although there is once more a tradeoff between accuracy and speed.

It can be seen that the SMM algorithms require more computation time than the LIBSP and LIBLP algorithms from *LIBLINEAR*, particularly when *p* is large. This is the case due to the use of careful and highly optimized use of a trust region or line-search Newton-like or conjugate gradient method within the two aforementioned algorithms (cf. Lin et al. 2008; Hsia et al. 2017). Unfortunately, the techniques that are implemented are specific to the use of the Newton-like or conjugate gradient methods for solving the squared-hinge and logistic loss SVM problems. Thus, the discussed techniques cannot be ported to work with our SMM algorithms in a trivial way.

Recently, works by Chouzenoux et al. (2011), Chouzenoux et al. (2013), and Chouzenoux and Pesquet (2017) have shown that the trust region, line search, and Newton-like methods can be used within the MM and SMM framework to obtain computationally efficient and competitive solutions to solving various optimization problems from image and signal processing. Furthermore, generic coordinate-descent methodology for MM algorithms can also be utilized to obtain speedup and better scaling with *p*. Such methods include the use of the majorizer of De Pierro (1993) or the blockwise frameworks of Razaviyayn et al. (2013) and Mairal (2015).

These directions were not pursued within this article due to the numerous additional theoretical results that would be required to demonstrate the global converge of such constructions. However, a result for a specific SMM solver of a least-squares regression problem was obtained by Chouzenoux and Pesquet (2017). This indicates that similar results for our SVM problems are also available We defer the construction of these extended algorithms and the establishment of their global convergence for future work.

## Notes

### Acknowledgements

We thank the Associate Editor and Reviewer of the article for making helpful comments that greatly improved our exposition. HDN was supported by Australian Research Council (ARC) Grant DE170101134. GJM was supported by ARC Grant DP170100907.

## References

- Abe, S. (2010).
*Support Vector Machines for Pattern Classification*. London: Springer.CrossRefzbMATHGoogle Scholar - Bohning, D., & Lindsay, B. R. (1988). Monotonicity of quadratic-approximation algorithms.
*Annals of the Institute of Mathematical Statistics*,*40*, 641–663.MathSciNetCrossRefzbMATHGoogle Scholar - Boyd, S., & Vandenberghe, L. (2004).
*Convex Optimization*. Cambridge: Cambridge University Press.CrossRefzbMATHGoogle Scholar - Cappe, O., & Moulines, E. (2009). On-line expectation-maximizatoin algorithm for latent data models.
*Journal of the Royal Statistical Society B*,*71*, 593–613.CrossRefzbMATHGoogle Scholar - Chouzenoux, E., Idier, J., & Moussaoui, S. (2011). A majorize-minimize strategy for subspace otpimization applied to image restoration.
*IEEE Transactions on Image Processing*,*20*, 1517–1528.MathSciNetCrossRefzbMATHGoogle Scholar - Chouzenoux, E., Jezierska, A., Pesquet, J.-C., & Talbot, H. (2013). A majorize-minimize subspace approach for \(l_2\)-\(l_0\) image regularization.
*SIAM Journal of Imaging Science*,*6*, 563–591.CrossRefzbMATHGoogle Scholar - Chouzenoux, E., & Pesquet, J.-C. (2017). A stochastic majorize-minimize subspace algorithm for online penalized least squares estimation.
*IEEE Transactions on Signal Processing*,*65*, 4770–4783.MathSciNetCrossRefGoogle Scholar - Cortes, C., & Vapnik, V. (1995). Support-vector networks.
*Machine Learning*,*20*, 273–297.zbMATHGoogle Scholar - De Pierro, A. R. (1993). On the relation between the ISRA and the EM algorithm for positron emission tomography.
*IEEE Transactions on Medical Imaging*,*12*, 328–333.CrossRefGoogle Scholar - Eddelbuettel, D. (2013).
*Seamless R and C++ Integration with Rcpp*. New York: Springer.CrossRefzbMATHGoogle Scholar - Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin, C.-J. (2008). LIBLINEAR: a library for large linear classification.
*Journal of Machine Learning Research*,*9*, 1871–1874.zbMATHGoogle Scholar - Groenen, P. J. F., Nalbantov, G., & Bioch, J. C. (2008). SVM-Maj: a majorization approach to linear support vector machines with different hinge errors.
*Advances in Data Analysis and Classification*,*2*, 17–43.MathSciNetCrossRefzbMATHGoogle Scholar - Helleputte, T. (2017).
*LiblineaR: Linear Predictive Models Based on the LIBLINEAR C/C++ Library*Google Scholar - Hsia, C.-Y., Zhu, Y., & Lin, C.-J. (2017). A study on trust region update rules in Newton methods for large-scale linear classification.
*Proceedings of Machine Learning Research*,*77*, 33–48.Google Scholar - Jolliffe, I. T. (2002).
*Principal Component Analysis*. New York: Springer.zbMATHGoogle Scholar - Kim, S., Pasupathy, R., & Henderson, S. G. (2015).
*Handbook of Simulation Optimization, chapter A guide to sample average approximation*(pp. 207–243). New York: Springer.Google Scholar - Lange, K. (2016).
*MM Optimization Algorithms*. Philadelphia: SIAM.CrossRefzbMATHGoogle Scholar - LeCun, Y. (1998). The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/Google Scholar
- Lin, C.-J., Weng, R. C., & Keerthi, S. S. (2008). Trust region Newton method for large-scale logistic regression.
*Journal of Machine Learning Research*,*9*, 627–650.MathSciNetzbMATHGoogle Scholar - Mairal, J. (2013). Stochastic majorization-minimization algorithms for large-scale optimization. In
*Advances in Neural Information Processing Systems*(pp. 2283–2291)Google Scholar - Mairal, J. (2015). Incremental majorization-minimization optimization with application to large-scale machine learning.
*SIAM Journal of Optimization*,*25*, 829–855.MathSciNetCrossRefzbMATHGoogle Scholar - McAfee, A., Brynjolfsson, E., & Davenport, T. H. (2012). Big data: the management revolution.
*Harvard Business Review*,*90*, 60–68.Google Scholar - Navia-Vasquez, A., Perez-Cruz, F., Artes-Rodriguez, A., & Figueiras-Vidal, A. R. (2001). Weighted least squares training of support vector classifiers leading to compact and adaptive schemes.
*IEEE Transactions on Neural Networks*,*12*, 1047–1059.CrossRefGoogle Scholar - Nguyen, H. D. & McLachlan, G. J. (2017). Iteratively-reweighted least-squares fitting of support vector machines: a majorization-minimization algorithm approach. In
*Proceedings of the 2017 Future Technologies Conference (FTC)*Google Scholar - Pollard, D. (1991). Asymptotics for least absolute deviation regression estimators.
*Econometric Theory*,*7*, 186–199.MathSciNetCrossRefGoogle Scholar - R Core Team (2016).
*R: a language and environment for statistical computing*. R Foundation for Statistical ComputingGoogle Scholar - Razaviyayn, M., Hong, M., & Luo, Z.-Q. (2013). A unified convergence analysis of block successive minimization methods for nonsmooth optimization.
*SIAM Journal of Optimization*,*23*, 1126–1153.MathSciNetCrossRefzbMATHGoogle Scholar - Razaviyayn, M., Sanjabi, M., & Luo, Z.-Q. (2016). A stochastic successive minimization method for nonsmooth nonconvex optimization with applications to transceiver design in wireless communication networks.
*Mathematical Programming Series B*,*157*, 515–545.MathSciNetCrossRefzbMATHGoogle Scholar - Scholkopf, B., & Smola, A. J. (2002).
*Learning with Kernels*. Cambridge: MIT Press.zbMATHGoogle Scholar - Shalev-Shwartz, S., Singer, Y., Srebro, N., & Cotter, A. (2011). Pegasos: primal estimated sub-gradient solver for SVM.
*Mathematical Programming Series B*,*127*, 3–30.MathSciNetCrossRefzbMATHGoogle Scholar - Shawe-Taylor, J., & Sun, S. (2011). A review of optimization methodologies in support vector machines.
*Neurocomputing*,*74*, 3609–3618.CrossRefGoogle Scholar - Steinwart, I., & Christmann, A. (2008).
*Support Vector Machine*. New York: Springer.zbMATHGoogle Scholar - Titterington, D. M. (1984). Recursive parameter estimation using incomplete data.
*Journal of the Royal Statistical Society B*,*46*, 257–267.MathSciNetzbMATHGoogle Scholar - Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In
*Proceedings of the twenty-first international conference on Machine learning*Google Scholar