An efficient primal dual prox method for nonsmooth optimization
 769 Downloads
 4 Citations
Abstract
We study the nonsmooth optimization problems in machine learning, where both the loss function and the regularizer are nonsmooth functions. Previous studies on efficient empirical loss minimization assume either a smooth loss function or a strongly convex regularizer, making them unsuitable for nonsmooth optimization. We develop a simple yet efficient method for a family of nonsmooth optimization problems where the dual form of the loss function is bilinear in primal and dual variables. We cast a nonsmooth optimization problem into a minimax optimization problem, and develop a primal dual prox method that solves the minimax optimization problem at a rate of \(O(1/T)\) assuming that the proximal step can be efficiently solved, significantly faster than a standard subgradient descent method that has an \(O(1/\sqrt{T})\) convergence rate. Our empirical studies verify the efficiency of the proposed method for various nonsmooth optimization problems that arise ubiquitously in machine learning by comparing it to the stateoftheart first order methods.
Keywords
Nonsmooth optimization Primal dual method Convergence rate Sparsity Efficiency1 Introduction
Formulating machine learning tasks as a regularized empirical loss minimization problem makes an intimate connection between machine learning and mathematical optimization. In regularized empirical loss minimization, one tries to jointly minimize an empirical loss over training samples plus a regularization term of the model. This formulation includes support vector machine (SVM) (Hastie et al. 2008), support vector regression (Smola and Schölkopf 2004), Lasso (Zhu et al. 2003), logistic regression, and ridge regression (Hastie et al. 2008) among many others. Therefore, optimization methods play a central role in solving machine learning problems and challenges exist in machine learning applications demand the development of new optimization algorithms.
Depending on the application at hand, various types of loss and regularization functions have been introduced in the literature. The efficiency of different optimization algorithms crucially depends on the specific structures of the loss and the regularization functions. Recently, there have been significant interests on gradient descent based methods due to their simplicity and scalability to large datasets. A wellknown example is the Pegasos algorithm (ShalevShwartz et al. 2011) which minimizes the \(\ell _2^2\) regularized hinge loss (i.e., SVM) and achieves a convergence rate of \(O(1/T)\), where \(T\) is the number of iterations, by exploiting the strong convexity of the regularizer. Several other first order algorithms (Ji and Ye 2009; Chen et al. 2009) are also proposed for smooth loss functions (e.g., squared loss and logistic loss) and nonsmooth regularizers (i.e., \(\ell _{1,\infty }\) and group lasso). They achieve a convergence rate of \(O(1/T^2)\) by exploiting the smoothness of the loss functions.
In this paper, we focus on a more challenging case where both the loss function and the regularizer are nonsmooth, to which we refer as nonsmooth optimization. Nonsmooth optimization of regularized empirical loss has found applications in many machine learning problems. Examples of nonsmooth loss functions include hinge loss (Vapnik 1998), generalized hinge loss (Bartlett and Wegkamp 2008), absolute loss (Hastie et al. 2008), and \(\epsilon \)insensitive loss (Rosasco et al. 2004); examples of nonsmooth regularizers include lasso (Zhu et al. 2003), group lasso (Yuan and Lin 2006), sparse group lasso (Yang et al. 2010), exclusive lasso (Zhou et al. 2010), \(\ell _{1,\infty }\) regularizer (Quattoni et al. 2009), and trace norm regularizer (Rennie and Srebro 2005).
Although there are already many existing studies on tackling smooth loss functions (e.g., square loss for regression, logistic loss for classification), or smooth regularizers (e.g., \(\ell _2^2\) norm), there are serious challenges in developing efficient algorithms for nonsmooth optimization. In particular, common tricks, such as smoothing nonsmooth objective functions (Nesterov 2005a, b), can not be applied to nonsmooth optimization to improve convergence rate. This is because they require both the loss functions and regularizers be written in the maximization form of bilinear functions, which unfortunately are often violated, as we will discuss later. In this work, we focus on optimization problems in machine learning where both the loss function and the regularizer are nonsmooth. Our goal is to develop an efficient gradient based algorithm that has a convergence rate of \(O(1/{T})\) for a wide family of nonsmooth loss functions and general nonsmooth regularizers.
It is noticeable that according to the information based complexity theory (Traub et al. 1988), it is impossible to derive an efficient first order algorithm that generally works for all nonsmooth objective functions. As a result, we focus on a family of nonsmooth optimization problems, where the dual form of the nonsmooth loss function is bilinear in both primal and dual variables. Additionally, we show that many nonsmooth loss functions have this bilinear dual form. We derive an efficient gradient based method, with a convergence rate of \(O(1/T)\), that explicitly updates both the primal and dual variables. The proposed method is referred to as Primal Dual Prox (Pdprox) method. Besides its capability of dealing with nonsmooth optimization, the proposed method is effective in handling the learning problems where additional constraints are introduced for dual variables.
The rest of this paper is organized as follows. Section 2 reviews the related work on minimizing regularized empirical loss especially the first order methods for largescale optimization. Section 3 presents some notations and definitions. Section 4 presents the proposed primal dual prox method, its convergence analysis, and several extensions of the proposed method. Section 5 presents the empirical studies, and Sect. 6 concludes this work.
2 Related work
Our work is closely related to the previous studies on regularized empirical loss minimization. In the following discussion, we mostly focus on nonsmooth loss functions and nonsmooth regularizers.
2.1 Nonsmooth loss functions
Hinge loss is probably the most commonly used nonsmooth loss function for classification. It is closely related to the maxmargin criterion. A number of algorithms have been proposed to minimize the \(\ell _2^2\) regularized hinge loss (Platt 1998; Joachims 1999, 2006; Hsieh et al. 2008; ShalevShwartz et al. 2011), and the \(\ell _1\) regularized hinge loss (Cai et al. 2010; Zhu et al. 2003; Fung and Mangasarian 2002). Besides the hinge loss, recently a generalized hinge loss function (Bartlett and Wegkamp 2008) has been proposed for cost sensitive learning. For regression, square loss is commonly used due to its smoothness. However, nonsmooth loss functions such as absolute loss (Hastie et al. 2008) and \(\epsilon \)insensitive loss (Rosasco et al. 2004) are useful for robust regression. The Bayes optimal predictor of square loss is the mean of the predictive distribution, while the Bayes optimal predictor of absolute loss is the median of the predictive distribution. Therefore absolute loss is more robust for longtailed error distributions and outliers (Hastie et al. 2008). Rosasco et al. (2004) also proved that the estimation error bound for absolute loss and \(\epsilon \)insensitive loss converges faster than that of square loss. Nonsmooth piecewise linear loss function has been used in quantile regression (Koenker 2005; Gneiting 2008). Unlike the absolute loss, the piecewise linear loss function can model nonsymmetric error in reality.
2.2 Nonsmooth regularizers
Besides the simple nonsmooth regularizers such as \(\ell _1,\,\ell _2\), and \(\ell _\infty \) norms (Duchi and Singer 2009), many other nonsmooth regularizers have been employed in machine learning tasks. Yuan and Lin (2006) introduced group lasso for selecting important explanatory factors in group manner. The \(\ell _{1,\infty }\) norm regularizer has been used for multitask learning (Argyriou et al. 2008). In addition, several recent works (Hou et al. 2011; Nie et al. 2010; Liu et al. 2009) considered mixed \(\ell _{2,1}\) regularizer for feature selection. Zhou et al. (2010) introduced exclusive lasso for multitask feature selection to model the scenario where variables within a single group compete with each other. Trace norm regularizer is another nonsmooth regularizer, which has found applications in matrix completion (Recht et al. 2010; Candès and Recht 2008), matrix factorization (Rennie and Srebro 2005; Srebro et al. 2005), and multitask learning (Argyriou et al. 2008; Ji and Ye 2009). The optimization algorithms presented in these works are usually limited: either the convergence rate is not guaranteed (Argyriou et al. 2008; Recht et al. 2010; Hou et al. 2011; Nie et al. 2010; Rennie and Srebro 2005; Srebro et al. 2005) or the loss functions are assumed to be smooth (e.g., the square loss or the logistic loss) (Liu et al. 2009; Ji and Ye 2009). Despite the significant efforts in developing algorithms for minimizing regularized empirical losses, it remains a challenge to design a first order algorithm that is able to efficiently solve nonsmooth optimization problems at a rate of \(O(1/T)\) when both the loss function and the regularizer are nonsmooth.
2.3 Gradient based optimization
Our work is closely related to (sub)gradient based optimization methods. The convergence rate of gradient based methods usually depends on the properties of the objective function to be optimized. When the objective function is strongly convex and smooth, it is well known that gradient descent methods can achieve a geometric convergence rate (Boyd and Vandenberghe 2004). When the objective function is smooth but not strongly convex, the optimal convergence rate of a gradient descent method is \(O(1/T^2)\), and is achieved by the Nesterov’s methods (Nesterov 2007). For the objective function which is strongly convex but not smooth, the convergence rate becomes \(O(1/T)\) (ShalevShwartz et al. 2011). For general nonsmooth objective functions, the optimal rate of any first order method is \(O(1/\sqrt{T})\). Although it is not improvable in general, recent studies are able to improve this rate to \(O(1/T)\) by exploring the special structure of the objective function (Nesterov 2005a, b). In addition, several methods are developed for composite optimization, where the objective function is written as a sum of a smooth and a nonsmooth function (Lan 2010; Nesterov 2007; Lin 2010). Recently, these optimization techniques have been successfully applied to various machine learning problems, such as SVM (Zhou et al. 2010), general regularized empirical loss minimization (Duchi and Singer 2009; Hu et al. 2009), trace norm minimization (Ji and Ye 2009), and multitask sparse learning (Chen et al. 2009). Despite these efforts, one major limitation of the existing (sub)gradient based algorithms is that in order to achieve a convergence rate better than \(O(1/\sqrt{T})\), they have to assume that the loss function is smooth or the regularizer is strongly convex, making them unsuitable for nonsmooth optimization.
2.4 Convex–concave optimization
The present work is also related to convex–concave minimization. Tseng (2008) and Nemirovski (2005) developed prox methods that have a convergence rate of \(O(1/T)\), provided the gradients are Lipschitz continuous and have been applied to machine learning problems (Sun et al. 2009). In contrast, our method achieves a rate of \(O(1/T)\) without requiring the whole gradient but part of the gradient to be Lipschitz continuous. Several other primaldual algorithms have been developed for regularized empirical loss minimization that update both primal and dual variables. Zhu and Chan (2008) proposed a primaldual method based on gradient descent, which only achieves a rate of \(O(1/\sqrt{T})\). It was generalized in Esser et al. (2010), which shares the similar spirit of the proposed algorithm. However, the explicit convergence rate was not established even though the convergence is proved. Mosci et al. (2010) presented a primaldual algorithm for group sparse regularization, which updates the primal variable by a prox method and the dual variable by a Newton’s method. In contrast, the proposed algorithm is a first order method that does not require computing the Hessian matrix as the Newton’s method does, and is therefore more scalable to large datasets. Combettes and Pesquet (2011) and Radu loan Bot Ernö Robert Csetnek (2012) proposed primaldual splitting algorithms for finding zeros of maximal monotone operators of special types. Lan et al. (2011) considered the primaldual convex formulations for general cone programming and apply Nesterov’s optimal first order method (Nesterov 2007), Nesterov’s smoothing technique (Nesterov 2005b), and Nemirovski’s prox method (Nemirovski 2005). Nesterov (2005a) proposed a primal dual gradient method for a special class of structured nonsmooth optimization problems by exploring an excessive gap technique.
2.5 Optimizing nonsmooth functions
We note that Nesterov’s smoothing technique (Nesterov 2005b) and excessive gap technique (Nesterov 2005a) can be applied to nonsmooth optimization and both achieve \(O(1/T)\) convergence rate for a special class of nonsmooth optimization problems. However, the limitation of these approaches is that they require all the nonsmooth terms (i.e., the loss and the regularizer) to be written as an explicit max structure that consists of a bilinear function in primal and dual variables, thus limits their applications to many machine learning problems. In addition, Nesterov’s algorithms need to solve additional maximizations problem at each iteration. In contrast, the proposed algorithm only requires mild condition on the nonsmooth loss functions (Sect. 4), and allows for any commonly used nonsmooth regularizers, without having to solve an additional optimization problem at each iteration. Compared to Nesterov’s algorithms, the proposed algorithm is applicable to a large class of nonsmooth optimization problems, is easier to implement, its convergence analysis is much simpler, and its empirical performance is usually comparably favorable. Finally we noticed that, as we are preparing our manuscript, a related work (Chambolle and Pock 2011) has recently been published in the Journal of Mathematical Imaging and Vision that shares a similar idea as this work. Both works maintain and update the primal and dual variables for solving a nonsmooth optimization problem, and achieve the same convergence rate (i.e., \(O(1/T)\)). However, our work distinguishes from Chambolle and Pock (2011) in the following aspects: (i) We propose and analyze two primal dual prox methods: one gives an extra gradient updating to dual variables and the other gives an extra gradient updating to primal variables. Depending on the nature of applications, one method may be more efficient than the others; (ii) In Sect. 4.6, we discuss how to efficiently solve the interim projection problems for updating both primal variable and dual variable, a critical issue for making the proposed algorithm practically efficient. In contrast, Chambolle and Pock (2011) simply assumes that the interim projection problems can be solved efficiently; (iii) We focus our analysis and empirical studies on the optimization problems that are closely related to machine learning. We demonstrate the effectiveness of the proposed algorithm on various classification, regression, and matrix completion tasks with nonsmooth loss functions and nonsmooth regularizers; (iv) We also conduct analysis and experiments on the convergence of the proposed methods when dealing with the \(\ell _1\) constraint on the dual variable, an approach that is commonly used in robust optimization, and observe that the proposed methods converge much faster when the bound of the \(\ell _1\) constraint is small and the obtained solution is more robust in terms of prediction in the presence of noise in labels. In contrast, the study (Chambolle and Pock 2011) only considers the application in image problems.
We also note that the proposed algorithm is closely related to proximal point algorithm (Rockafellar 1976) as shown in He and Yuan (2012), and many variants including the modified Arrow–Hurwicz method (Popov 1980), the Douglas–Rachford (DR) splitting algorithm (Lions and Mercier 1979), the alternating method of multipliers (ADMM) (Boyd et al. 2011), the forward–backward splitting algorithm (Bredies 2009), the FISTA algorithm (Beck and Teboulle 2009). For a detailed comparison with some of these algorithms, one can refer to Chambolle and Pock (2011).
3 Notations and definitions
In this section we provide the basic setup, some preliminary definitions and notations used throughout this paper.
We denote by \([n]\) the set of integers \(\{1,\cdots , n\}\). We denote by \(({\mathbf {x}}_i, y_i), i \in [n]\) the training examples, where \({\mathbf {x}}_i\in {\mathcal {X}}\subseteq {\mathbb {R}}^d\) and \(y_i\) is the assigned class label, which is discrete for classification and continuous for regression. We assume \(\Vert {\mathbf {x}}_i\Vert _2\le R, \; \forall i \in [n]\). We denote by \({\mathbf {X}}=({\mathbf {x}}_1,\cdots , {\mathbf {x}}_n)^{\top }\) and \({\mathbf {y}}=(y_1,\cdots , y_n)^{\top }\). Let \({\mathbf {w}}\in {\mathbb {R}}^d\) denote the linear hypothesis, \(\ell ({\mathbf {w}}; {\mathbf {x}}, y)\) denote a loss of prediction made by the hypothesis \({\mathbf {w}}\) on example \(({\mathbf {x}}, y)\), which is a convex function in terms of \({\mathbf {w}}\). Examples of convex loss function are hinge loss \(\ell ({\mathbf {w}}; {\mathbf {x}}, y)= \max (1y{\mathbf {w}}^{\top }{\mathbf {x}}, 0)\), and absolute loss \(\ell ({\mathbf {w}}; {\mathbf {x}}, y) ={\mathbf {w}}^{\top }{\mathbf {x}} y\). To characterize a function, we introduce the following definitions
Definition 1
Definition 2
We denote by \(\varPi _{\mathcal {Q}}[\widehat{{\mathbf {z}}}]=\arg \min \limits _{{\mathbf {z}}\in {\mathcal {Q}}}\frac{1}{2}\Vert {\mathbf {z}}\widehat{{\mathbf {z}}}\Vert _2^2\) the projection of \(\widehat{{\mathbf {z}}}\) into domain \({\mathcal {Q}}\), and by \(\varPi _{{\mathcal {Q}}_1, {\mathcal {Q}}_2}\begin{pmatrix}\widehat{{\mathbf {z}}}_1\\ \widehat{{\mathbf {z}}}_2\end{pmatrix}\) the joint projection of \(\widehat{{\mathbf {z}}}_1\) and \(\widehat{{\mathbf {z}}}_2\) into domains \({\mathcal {Q}}_1\) and \({\mathcal {Q}}_2\), respectively. Finally, we use \([s]_{[0,a]}\) to denote the projection of \(s\) into \([0, a]\), where \(a>0\).
4 Pdprox: a primal dual prox method for nonsmooth optimization
We first describe the nonsmooth optimization problems that the proposed algorithm can be applied to, and then present the primal dual prox method for nonsmooth optimization. We then prove the convergence rate of the proposed algorithms and discuss several extensions. Proofs for technical lemmas are deferred to the appendix.
4.1 Nonsmooth optimization
Remark 1 One direct consequence of assumption in (3) is that the partial gradient \(G_{{\mathbf {w}}}({\mathbf {w}},{\varvec{\alpha }})\) is independent of \({\mathbf {w}}\), and \(G_{\varvec{\alpha }}({\mathbf {w}},{\varvec{\alpha }})\) is independent of \({\varvec{\alpha }}\), since \(L({\mathbf {w}},{\varvec{\alpha }})\) is bilinear in \({\mathbf {w}}\) and \({\varvec{\alpha }}\). We will explicitly exploit this property in developing the efficient optimization algorithms. We also note that no explicit assumption is made for the regularizer \(R({\mathbf {w}})\). This is in contrast to the smoothing techniques used in Nesterov (2005a, b).
 Hinge loss (Vapnik 1998):$$\begin{aligned} \ell ({\mathbf {w}}; {\mathbf {x}}, y)&=\max (0, 1y{\mathbf {w}}^{\top }{\mathbf {x}})=\max _{\alpha \in [0, 1]} \alpha (1y{\mathbf {w}}^{\top }{\mathbf {x}}). \end{aligned}$$
 Generalized hinge loss (Bartlett and Wegkamp 2008):where \(a>1\).$$\begin{aligned} \ell ({\mathbf {w}};{\mathbf {x}}, y)&= \left\{ \begin{array}{ll} 1ay{\mathbf {w}}^{\top }{\mathbf {x}}&{} \quad \text{ if }\;y{\mathbf {w}}^{\top }{\mathbf {x}}\le 0\\ 1y{\mathbf {w}}^{\top }{\mathbf {x}}&{} \quad \text{ if }\; 0<y{\mathbf {w}}^{\top }{\mathbf {x}}< 1\\ 0&{} \quad \text{ if }\; y{\mathbf {w}}^{\top }{\mathbf {x}}\ge 1 \end{array} \right. \\&= {\mathop {\mathop {\max }\limits _{\alpha _1 \ge 0, \alpha _2 \ge 0}}\limits _{\alpha _1+\alpha _2\le 1}}\alpha _1(1ay{\mathbf {w}}^{\top }{\mathbf {x}}) + \alpha _2(1y{\mathbf {w}}^{\top }{\mathbf {x}}), \end{aligned}$$
 Absolute loss (Hastie et al. 2008):$$\begin{aligned} \ell ({\mathbf {w}}; {\mathbf {x}}, y)={\mathbf {w}}^{\top }{\mathbf {x}}y=\max _{\alpha \in [1,1]}\alpha ({\mathbf {w}}^{\top }{\mathbf {x}} y). \end{aligned}$$
 \(\epsilon \)insensitive loss (Rosasco et al. 2004):$$\begin{aligned} \ell ({\mathbf {w}}; {\mathbf {x}}, y)\!=\!\max ({\mathbf {w}}^{\top }{\mathbf {x}}y\epsilon , 0)\!=\!{\mathop {\mathop {\max }\limits _{\alpha _1\ge 0,\alpha _2\ge 0}}\limits _{\alpha _1+\alpha _2\le 1}}\left[ ({\mathbf {w}}^{\top }{\mathbf {x}} y)(\alpha _1\alpha _2)  \epsilon (\alpha _1+\alpha _2)\right] . \end{aligned}$$
 Piecewise linear loss (Koenker 2005):$$\begin{aligned} \ell ({\mathbf {w}};{\mathbf {x}}, y)&= \left\{ \begin{array}{ll} a{\mathbf {w}}^{\top }{\mathbf {x}} y&{} \quad \text{ if }\,{\mathbf {w}}^{\top }{\mathbf {x}}\le y\\ (1a){\mathbf {w}}^{\top }{\mathbf {x}}y &{} \quad \text{ if }\,{\mathbf {w}}^{\top }{\mathbf {x}}\ge y \end{array} \right. \\&= {\mathop {\mathop {\max }\limits _{\alpha _1 \ge 0, \alpha _2 \ge 0}}\limits _{\alpha _1+\alpha _2\le 1}}\alpha _1a(y{\mathbf {w}}^{\top }{\mathbf {x}}) + \alpha _2(1a)({\mathbf {w}}^{\top }{\mathbf {x}}y). \end{aligned}$$
 \(\ell _2\) loss (Nie et al. 2010):where \({\mathbf {y}}\in {\mathbb {R}}^K\) is multiple class label vector and \({\mathbf {W}}=({\mathbf {w}}_1,\cdots , {\mathbf {w}}_K)\).$$\begin{aligned} \ell ({\mathbf {W}}; {\mathbf {x}}, {\mathbf {y}}) = \Vert {\mathbf {W}}^{\top }{\mathbf {x}}{\mathbf {y}}\Vert _2 = \max _{\Vert \alpha \Vert _2\le 1} \alpha ^{\top }({\mathbf {W}}^{\top }{\mathbf {x}}{\mathbf {y}}), \end{aligned}$$

lasso: \(R({\mathbf {w}})=\Vert {\mathbf {w}}\Vert _1,\,\ell _2\) norm: \(R({\mathbf {w}})=\Vert {\mathbf {w}}\Vert _2\), and \(\ell _{\infty }\) norm: \(R({\mathbf {w}})=\Vert {\mathbf {w}}\Vert _\infty \).

group lasso: \(R({\mathbf {w}})=\sum _{g=1}^K \sqrt{d_g}\Vert {\mathbf {w}}_g\Vert _2\), where \({\mathbf {w}}_g\in {\mathbb {R}}^{d_g}\).

exclusive lasso: \(R({\mathbf {W}})= \sum _{j=1}^d \Vert {\mathbf {w}}^j\Vert _1^{2}\).

\(\ell _{2,1}\) norm: \(R({\mathbf {W}})= \sum _{j=1}^d \Vert {\mathbf {w}}^j\Vert _{2}\).

\(\ell _{1,\infty }\) norm: \(R({\mathbf {W}})=\sum _{j=1}^d \Vert {\mathbf {w}}^{j}\Vert _\infty \).

trace norm: \(R({\mathbf {W}})=\Vert {\mathbf {W}}\Vert _1\), the summation of singular values of \({\mathbf {W}}\).

other regularizers: \(R({\mathbf {W}})=\left( \sum _{k=1}^K\Vert {\mathbf {w}}_k\Vert _2\right) ^2\).
We close this section by presenting a lemma showing an important property of the bilinear function \(L({\mathbf {w}}, {\varvec{\alpha }})\).
Lemma 1
Remark 2 The value of constant \(c\) in Lemma 1 is an input to our algorithms used to set the step size. In Appendix 1, we show how to estimate constant \(c\) for certain loss functions. In addition the constant \(c\) in bounds (6) and (7) do not have to be the same as shown by the the example of generalized hinge loss in Appendix 1. It should be noticed that the inequalities in Lemma 1 indicate \(L({\mathbf {w}},{\varvec{\alpha }})\) has Liptschitz continuos gradients, however, the gradient of the whole objective with respect to \({\mathbf {w}}\), i.e., \(G_{\mathbf {w}}({\mathbf {w}},{\varvec{\alpha }})+\lambda \partial R({\mathbf {w}})\) is not Lipschitz continuous due to the general nonsmooth term \(R({\mathbf {w}})\), which prevents previous convexconcave minimization scheme (Tseng 2008; Nemirovski 2005) not applicable.
4.2 The proposed primaldual prox methods
In this subsection, we present two variants of Primal Dual Prox (Pdprox) method for solving the nonsmooth optimization problem in (2). The common feature shared by the two algorithms is that they update both the primal and the dual variables at each iteration. In contrast, most first order methods only update the primal variables. The key advantages of the proposed algorithms is that they are able to capture the sparsity structures of both primal and dual variables, which is usually the case when both the regularizer and the loss functions are both nonsmooth. The two algorithms differ from each other in the number of copies for the dual or the primal variables, and the specific order for updating those. Although our analysis shows that the two algorithms share the same convergence rate; however, our empirical studies show that the one algorithm is more preferable than the other depending on the nature of the applications.
4.3 Pdproxdual algorithm
 (i)
it updates both the dual variable \(\varvec{\alpha }\) and the primal variable \({\mathbf {w}}\). This is useful when additional constraints are introduced for the dual variables, as we will discuss later.
 (ii)
it introduces an extra dual variable \(\varvec{\beta }\) in addition to \(\varvec{\alpha }\), and updates both \(\varvec{\alpha }\) and \(\varvec{\beta }\) at each iteration by a gradient mapping. The gradient mapping on the dual variables into a sparse domain allows the proposed algorithm to capture the sparsity of the dual variables (more discussion on how the sparse constraint on the dual variable affects the convergence is presented in Sect. 4.7). Compared to the second algorithm presented below, we refer to Algorithm 1 as Pdproxdual algorithm since it introduces an extra dual variable in updating.
 (iii)
the primal variable \({\mathbf {w}}\) is updated by a composite gradient mapping (Nesterov 2007) in step 5. Solving a composite gradient mapping in this step allows the proposed algorithm to capture the sparsity of the primal variable. Similar to many other approaches for composite optimization (Duchi and Singer 2009; Hu et al. 2009), we assume that the mapping in step 5 can be solved efficiently. (This is the only assumption we made on the nonsmooth regularizer. The discussion in Sect. 4.6 shows that the proposed algorithm can be applied to a large family of nonsmooth regularizers).
 (iv)
the step size \(\gamma \) is fixed to \(\sqrt{1/(2c)}\), where \(c\) is the constant specified in Lemma 1. This is in contrast to most gradient based methods where the step size depends on \(T\) and/or \(\lambda \). This feature is particularly useful in implementation as we often observe that the performance of a gradient method is sensitive to the choice of the step size.
4.4 Pdproxprimal algorithm
In Algorithm 1, we maintain two copies of the dual variables \(\varvec{\alpha }\) and \(\varvec{\beta }\), and update them by two gradient mappings^{1}. We can actually save one gradient mapping on the dual variable by first updating the primal variable \({\mathbf {w}}_t\), and then updating \(\varvec{\alpha }_t\) using partial gradient computed with \({\mathbf {w}}_t\). As a tradeoff, we add an extra primal variable \({\mathbf {u}}\), and update it by a simple calculation. The detailed steps are shown in Algorithm 2. Similar to Algorithm 1, Algorithm 2 also needs to compute two partial gradients (except for the initial partial gradient on the primal variable), i.e., \(G_{\mathbf {w}}(\cdot , \varvec{\alpha }_t)\) and \(G_{\varvec{\alpha }}({\mathbf {w}}_t, \cdot )\). Different from Algorithm 1, Algorithm 2 (i) maintains \(({\mathbf {w}}_t, \varvec{\alpha }_t, {\mathbf {u}}_t)\) at each iteration with \(O(2d+n)\) memory, while Algorithm 1 maintains \((\varvec{\alpha }_t, {\mathbf {w}}_t, \varvec{\beta }_t)\) at each iteration with \(O(2n+d)\) memory; (ii) and replaces one gradient mapping on an extra dual variable \(\varvec{\beta }_t\) with a simple update on an extra primal variable \({\mathbf {u}}_t\). Depending on the nature of applications, one method may be more efficient than the other. For example, if the dimension \(d\) is much larger than the number of examples \(n\), then Algorithm 1 would be more preferable than Algorithm 2. When the number of examples \(n\) is much larger than the dimension \(d\), Algorithm 2 could save the memory and the computational cost. However, as shown by our analysis in Sect. 4.5, the convergence rate of two algorithms are the same. Because it introduces an extra primal variable, we refer to Algorithm 2 as the Pdproxprimal algorithm.
Remark 1
It should be noted that although Algorithm 1 uses a similar strategy for updating the dual variables \(\varvec{\alpha }\) and \(\varvec{\beta }\), but it is significantly different from the mirror prox method (Nemirovski 2005). First, unlike the mirror prox method that introduces an auxiliary variable for \({\mathbf {w}}\), Algorithm 1 introduces a composite gradient mapping for updating \({\mathbf {w}}\). Second, Algorithm 1 updates \({\mathbf {w}}_t\) using the partial gradient computed from the updated dual variable \(\varvec{\alpha }_t\) rather than \(\varvec{\beta }_{t1}\). Third, Algorithm 1 does not assume that the overall objective function has Lipschitz continuous gradients, a key assumption that limits the application of the mirror prox method.
Remark 2
A similar algorithm with an extra primal variable is also proposed in a recent work (Chambolle and Pock 2011). It is slightly different from Algorithm 2 in the order of updating on the primal variable and the dual variable, and the gradients used in the updating. We discuss the differences between the Pdprox method and the algorithm in Chambolle and Pock (2011) with our notations in Appendix 3.
4.5 Convergence analysis
Theorem 1
Remark 3
Before proceeding to the proof of Theorem 1, we present the following Corollary that follows immediately from Theorem 1 and states the convergence bound for the objective \({\mathcal {L}}({\mathbf {w}})\) in (2).
Corollary 1
Proof
In order to aid understanding, we present the proof of Theorem 1 for each algorithm separately in the following subsections.
4.5.1 Convergence analysis of Algorithm 1
Lemma 2
Proof
Finally, by plugging (13) for \({\mathbf {w}}_t\) into the update for \({\mathbf {u}}_t\) in (14), we complete the proof of Lemma 2. \(\square \)
The reason that we translate the updates for \((\varvec{\alpha }_t, {\mathbf {w}}_t, \varvec{\beta }_t)\) in Algorithm 1 into the updates for \((\varvec{\alpha }_t, {\mathbf {w}}_t, \varvec{\beta }_t, {\mathbf {u}}_t)\) in Lemma 2 is because it allows us to fit the updates for \((\varvec{\alpha }_t, {\mathbf {w}}_t, \varvec{\beta }_t, {\mathbf {u}}_t)\) into Lemma 8 as presented in Appendix 4, which leads us to a key inequality as stated in Lemma 3 to prove Theorem 1.
Lemma 3
The proof of Lemma 3 is deferred to Appendix 4. We are now ready to prove the main theorem for Algorithm 1.
Proof
4.5.2 Convergence analysis of Algorithm 2
We can prove the convergence bound for Algorithm 2 by following the same path. In the following we present the key lemmas similar to Lemmas 2 and 3, with proofs omitted.
Lemma 4
Lemma 5
Proof
Comparison with Pegasos on \(\ell ^2_2\) regularizer We compare the proposed algorithm to the Pegasos algorithm (ShalevShwartz et al. 2011)^{2} for minimizing the \(\ell _2^2\) regularized hinge loss. Although in this case both algorithms achieve a convergence rate of \(O(1/T)\), their dependence on the regularization parameter \(\lambda \) is very different. In particular, the convergence rate of the proposed algorithm is \(O\left( \frac{(1 + n\lambda )R}{\sqrt{2n}\lambda T}\right) \) by noting that \(\Vert {\mathbf {w}}^*\Vert _2^2 =O(1/\lambda ),\,\Vert \varvec{\alpha }^*\Vert ^2_2\le \Vert \varvec{\alpha }^*\Vert _1\le n\), and \(c=R^2/n\), while the Pegasos algorithm has a convergence rate of \(\widetilde{O}\left( \frac{(\sqrt{\lambda }+R)^2}{\lambda T}\right) \) (Corollary 1 in ShalevShwartz et al. 2011), where \(\widetilde{O}(\cdot )\) suppresses a logarithmic term \( \ln (T)\). According to the common assumption of learning theory (Wu and Zhou 2005; Smale and Zhou 2003), the optimal \(\lambda \) is \(O(n^{1/(\tau + 1)})\) if the probability measure can be approximated by the closure of RKHS \({\mathcal {H}}_{\kappa }\) with exponent \(0 < \tau \le 1\). As a result, the convergence rate of the proposed algorithm is \(O(\sqrt{n}R/T)\) while the convergence rate of Pegasos is \(O(n^{1/(1+\tau )}R^2/T)\). Since \(\tau \in (0, 1]\), the proposed algorithm could be more efficient than the Pegasos algorithm, particularly when \(\lambda \) is sufficiently small. This is verified by our empirical studies in Sect. 5.7 (see Fig. 8). It is also interesting to note that the convergence rate of Pdprox has a better dependence on \(R\), the \(\ell _2\) norm bound of examples \(\Vert {\mathbf {x}}\Vert _2\le R\), compared to \(R^2\) in the convergence rate of Pegasos. Finally, we mention that the proposed algorithm is a deterministic algorithm that requires a full pass of all training examples at each iteration, while Pegasos can be purely stochastic by sampling one example for computing the subgradient, which maintains the same convergence rate. It remains an interesting and open problem to extend the Pdprox algorithm to its stochastic or randomized version with a similar convergence rate.
4.6 Implementation issues
In this subsection, we discuss some implementation issues: (1) how to efficiently solve the optimization problems for updating the primal and dual variables in Algorithms 1 and 2; (2) how to set a good step size; and (3) how to implement the algorithms efficiently.
Both \(\varvec{\alpha }\) and \(\varvec{\beta }\) are updated by a gradient mapping that requires computing the projection into the domain \({\mathcal {Q}}_{\varvec{\alpha }}\). When \({\mathcal {Q}}_{\varvec{\alpha }}\) is only consisted of box constraints (e.g., hinge loss, absolute loss, and \(\epsilon \)insensitive loss), the projection \(\prod _{{\mathcal {Q}}_\alpha }[\widehat{\alpha }]\) can be computed by thresholding. When \({\mathcal {Q}}_{\varvec{\alpha }}\) is comprised of both box constraints and a linear constraint (e.g., generalized hinge loss), the following lemma gives an efficient algorithm for computing \(\prod _{{\mathcal {Q}}_{\varvec{\alpha }}}[\widehat{\varvec{\alpha }}]\).
Lemma 6
Remark 4
It is notable that when the domain is a simplex type domain, i.e. \(\sum _i\alpha _i\le \rho \), Duchi et al. (2008) has proposed more efficient algorithms for solving the projection problem.
Remark 5
In Appendix 3, we show that the updates on \(({\mathbf {w}}_t, \varvec{\alpha }_t)\) of Algorithm 3 are essentially the same to the Algorithm 1 in Chambolle and Pock (2011), if we remove the extra dual variable in Algorithm 3 and the extra primal variable in Algorithm 1 in Chambolle and Pock (2011). However, the difference is that in Algorithm 3, we maintain two dual variables and one primal variable at each iteration, while the Algorithm 1 in Chambolle and Pock (2011) maintains two primal variables and one dual variable at each iteration.
For the composite gradient mapping for \({\mathbf {w}}\in {\mathcal {Q}}_{\mathbf {w}}={\mathbb {R}}^d\), there is a closed form solution for simple regularizers (e.g., \(\ell _1,\ell _2\)) and decomposable regularizers (e.g., \(\ell _{1,2}\)). Efficient algorithms are available for composite gradient mapping when the regularizer is the \(\ell _{\infty }\) and \(\ell _{1,\infty }\), or trace norm. More details can be found in Duchi and Singer (2009) and Ji and Ye (2009). Here we present an efficient solution to a general regularizer \(V(\Vert {\mathbf {w}}\Vert )\), where \(\Vert {\mathbf {w}}\Vert \) is either a simple regularizer (e.g., \(\ell _1,\,\ell _2\), and \(\ell _{\infty }\)) or a decomposable regularizer (e.g., \(\ell _{1,2}\) and \(\ell _{1, \infty }\)), and \(V(z)\) is convex and monotonically increasing for \(z \ge 0\). An example is \(V(\Vert {\mathbf {w}}\Vert )=(\sum _{k}\Vert {\mathbf {w}}_k\Vert _2)^2\), where \({\mathbf {w}}_1, \ldots , {\mathbf {w}}_K\) forms a partition of \({\mathbf {w}}\).
Lemma 7
The value of the step size \(\gamma \) in Algorithms 2 and 3 depends on the value of \(c\), a constant that upper bounds the spectral norm square of the matrix \(H({\mathbf {X}}, {\mathbf {y}})\). In many machine learning applications, by assuming a bound on the data (e.g., \(\Vert {\mathbf {x}}\Vert _2\le R\)), one can easily compute an estimate of \(c\). We present derivations of the constant \(c\) for hinge loss and generalized hinge loss in Appendix 1. However, the computed value of \(c\) might be overestimated, thus the step size \(\gamma \) is underestimated. Therefore, to improve the empirical performances, one can scale up the estimated value of \(\gamma \) by a factor larger than one and choose the best factor by tuning among a set of values. In addition, the authors in Chambolle and Pock (2011) suggested a two step sizes scheme with \(\tau \) for updating the primal variable and \(\sigma \) for updating the dual variable. Depending on the nature of applications, one may observe better performances by carefully choosing the ratio between the two step sizes provided that \(\sigma \) and \(\tau \) satisfy \(\sigma \tau \le 1/c\). In the last subsection, we observe the improved performance for solving SVM by using the two step sizes scheme and by carefully tuning the ratio between the two step sizes. Furthermore, Pock and Chambolle (2011) presents a technique for computing diagonal preconditioners in the cases when estimating the value of \(c\) is difficult for complex problems, and applies it to general linear programing problems and some computer vision problems.
4.7 Extensions and discussion
4.7.1 Nonlinear model
4.7.2 Incorporating the bias term
It is easy to learn a bias term \(w_0\) in the classifier \({\mathbf {w}}^{\top }{\mathbf {x}}+ w_0\) by Pdprox without too many changes. We can use the augmented feature vector \(\widehat{{\mathbf {x}}}_i = \left( \begin{array}{c}1 \\ {\mathbf {x}}_i\end{array}\right) \) and the augmented weight vector \(\widehat{{\mathbf {w}}}=\left( \begin{array}{c}w_0\\ {\mathbf {w}}\end{array}\right) \), and run Algorithms 1 or 2 with no changes except that the regularizer \(R(\widehat{{\mathbf {w}}}) = R({\mathbf {w}})\) does not involve \(w_0\) and the step size \(\gamma =\sqrt{1/(2c)}\) will be a different value due to the change in the bound of the new feature vectors by \(\Vert \widehat{\mathbf {x}}\Vert _2\le \sqrt{1+R^2}\), which would yield a different value of \(c\) in Lemma 1 (c.f. Appendix 1).
4.7.3 Domain constraint on primal variable
4.7.4 Additional constraints on dual variables
5 Experiments

In Sects. 5.1, 5.2, and 5.3 we compare the proposed algorithm to the stateoftheart first order methods that directly update the primal variable at each iteration. We apply all the algorithms to three different tasks with different nonsmooth loss functions and regularizers. The baseline first order methods used in this study include the gradient descent algorithm (gd), the forward and backward splitting algorithm (fobos) (Duchi and Singer 2009), the regularized dual averaging algorithm (rda) (Xiao 2009), the accelerated gradient descent algorithm (agd) (Chen et al. 2009). Since the proposed method is a nonstochastic method, we compare it to the nonstochastic variant of gd, fobos, and rda. Note that gd, fobos, rda, and agd share the same convergence rate of \(O(1/\sqrt{T})\) for nonsmooth problems.

In Sect. 5.4, our algorithm is compared to the stateoftheart primal dual gradient method (Nesterov 2005a), which employs an excessive gap technique for nonsmooth optimization, updates both the primal and dual variables at each iteration, and has a convergence rate of \(O(1/T)\).

In Sect. 5.5, we test the proposed algorithm for optimizing problem in (19) with a sparsity constraint on the dual variables.

In Sect. 5.7, we compare the two variants of the proposed method on a data set when \(n\gg d\), and compare Pdprox to the Pegasos algorithm.
5.1 Group lasso regularizer for grouped feature selection
In this experiment we use the group lasso for regularization, i.e., \(R({\mathbf {w}})=\sum _g\sqrt{d_g} \Vert {\mathbf {w}}_g\Vert _2\), where \({\mathbf {w}}_g\) corresponds to the \(g\)th group variables and \(d_g\) is the number of variables in group \(g\). To apply Nesterov’s method, we can write \(R({\mathbf {w}})=\max _{\Vert {\mathbf {u}}_g\Vert _2\le 1} \sum _g\sqrt{d_g}{\mathbf {w}}_g^{\top }{\mathbf {u}}_g\). We use the MEMset Donar dataset (Yeo and Burge 2003) as the testbed. This dataset was originally used for splice site detection. It is divided into a training set and a test set: the training set consists of 8,415 true and 179,438 false donor sites, and the testing set has 4,208 true and 89,717 false donor sites. Each example in this dataset was originally described by a sequence of {A, C, G, T} of length 7. We follow Yang et al. (2010) and generate group features with up to threeway interactions between the 7 positions, leading to 2,604 attributes in 63 groups. We normalize the length of each example to 1. Following the experimental setup in Yang et al. (2010), we construct a balanced training dataset consisting of all 8,415 true and 8,415 false donor sites that are randomly sampled from all 179,438 false sites.
5.2 \(\ell _{1,\infty }\) regularization for multitask learning
5.3 Trace norm regularization for maxmargin matrix factorization/ matrix completion
5.4 Comparison: Pdprox versus primaldual method with excessive gap technique
In this section, we compare the proposed primal dual prox method to Nesterov’s primal dual method (Nesterov 2005a), which is an improvement of his algorithm in Nesterov (2005b). The algorithm in Nesterov (2005b) for nonsmooth optimization suffers a problem of setting the value of smoothing parameter that requires the number of iterations to be fixed in advance. Nesterov (2005a) addresses the problem by exploring an excessive gap technique and updating both the primal and dual variables, which is similar to the proposed Pdprox method. We refer to this baseline as Pdexg. We run both algorithms on the three tasks as in Sects. 5.1, 5.2, and 5.3, i.e., group feature selection with hinge loss and group lasso regularizer on MEMset Donar data set, multitask learning with \(\epsilon \)insensitive loss and \(\ell _{1,\infty }\) regularizer on School data set, and matrix completion with absolute loss and trace norm regularizer on 100 K MovieLens data set. To implement the primal dual method with excessive gap technique, we need to intentionally add a domain on the optimal primal variable, which can be derived from the formulation. For example, in group feature selection problem whose objective is \(1/n\sum _{i=1}^n\ell ({\mathbf {w}}^{\top }{\mathbf {x}}_i, y_i) + \lambda \sum _g \sqrt{d_g}\Vert {\mathbf {w}}_g\Vert _2\), we can derive that the optimal primal variable \({\mathbf {w}}^*\) lies in \(\Vert {\mathbf {w}}\Vert _2\le \sum _{g}\Vert {\mathbf {w}}_g\Vert _2\le \frac{1}{\lambda \sqrt{d_{\min }}}\), where \(d_{\min }=\min _g d_g\). Similar techniques are applied to multitask learning and matrix completion.
The results show that the proposed Pdprox method converges faster than Pdexg on MEMset Donar data set for group feature selection with hinge loss and group lasso regularizer, and on 100 K MovieLens data set for matrix completion with absolute loss and trace norm regularizer. However, Pdexg performs better on School data set for multitask learning with \(\epsilon \)insensitive loss and \(\ell _{1,\infty }\) regularizer. One interesting phenomenon we can observe from Fig. 5 is that for larger values of \(\lambda \) (e.g., \(10^{3}\)), the improvement of Pdprox over Pdexg is also larger. The reason is that the proposed Pdprox captures the sparsity of primal variable at each iteration. This does not hold for Pdexg because it casts the nonsmooth regularizer into a dual form and consequently does not explore the sparsity of the primal variable at each iteration. Therefore the larger of \(\lambda \), the sparser of the primal variable at each iteration in Pdprox that yields to larger improvement over Pdexg. For the example of group feature selection task with hinge loss and group lasso regularizer, when setting \(\lambda =10^{3}\), the sparsity of the primal variable (i.e., the proportion of the number of group features with zero norm) in Pdprox averaged over all iterations is 0.7886. However, by reducing \(\lambda \) to \(10^{5}\) the average sparsity of the primal variable in Pdprox is reduced to 0. In both settings the average sparsity of the primal variable in Pdexg is 0. The same argument also explains why Pdprox does not perform as well as Pdexg on School data set when setting \(\lambda =10^{5}\), since in this case the primal variables in both algorithms are not sparse. When setting \(\lambda =10^{3}\), the average sparsity (i.e., the proportion of the number of features with zero norm across all tasks) of the primal variable in Pdprox and Pdexg is 0.3766 and 0, respectively. Finally, we also observe similar performance of the two algorithms on the three tasks with other loss functions including absolute loss for group feature selection, absolute loss for multitask learning, and hinge loss for maxmargin matrix factorization.
5.5 Sparsity constraint on the dual variables
Running time (forth column) and classification accuracy (fifth column) of Pdprox for (19) and of Liblinear on noisily labeled training data, where noise is added to labels by random flipping with a probability 0.2. We fix \(\lambda =1/n\) or \(C=1\) in Liblinear
Data set  \((n,d)\)/ACC(%)  Alg.  Running time  ACC(%) 

a9a  (32561, 123)  Pdprox(m = 200)  0.82s(0.01)  83.44(0.1) 
85.01  Liblinear  1.15s(0.57)  78.90(0.4)  
rcv1  (20242, 47236)  Pdprox(m = 200)  1.57s(0.23)  94.05(0.2) 
96.54  Liblinear  3.30s(0.74)  93.66(0.2)  
covtype  (571012, 54)  Pdprox(m = 4000)  48s(3.34)  \(73.58(0.01)\) 
75.80  Liblinear  37s(0.64)  \(68.66(0.001)\) 
Finally, we note that choosing a small \(m\) in Eq. (19) is different from simply training a classifier with a small number of examples. For instance, for rcv1, we have run the experiment with 200 training examples, randomly selected from the entire data set. With the same stopping criterion, the testing performance is \(0.8131(\pm 0.05)\), significantly lower than that of optimizing (19) with \(m = 200\).
5.6 Comparison: doubleprimal versus doubledual implementation
From the discussion in Sect. 4.6, we have seen that both Pdproxprimal and Pdproxdual algorithm can be implemented either by maintaining two dual variables, to which we refer as doubledual implementation, or by maintaining two primal variables, to which we refer as doubleprimal implementation. One implementation could be more efficient than the other implementation, depending on the nature of applications. For example, in multitask regression with \(\ell _2\) loss (Nie et al. 2010), if the number of examples is much larger than the number of attributes, i.e., \(n\gg d\), and the number of tasks \(K\) is large, then the size of dual variable \(\alpha \in {{\mathbb {R}}}^{n\times K}\) is much larger than the size of primal variable \(W\in {\mathbb {R}}^{d\times K}\). It would be expected that the doubleprimal implementation is more efficient than the doubledual implementation. In contrast, in matrix completion with absolute loss, if the number of observed entries \(\varOmega \) which corresponds to the size of dual variable is much less than the total number of entries \(n^2\) which corresponds to the size of primal variable, then the doubledual implementation would be more efficient than the doubleprimal implementation.
In the following experiment, we restrict our demonstration to a binary classification problem that given a set of training examples \(({\mathbf {x}}_i,y_i),i=1,\ldots , n\), where \({\mathbf {x}}_i\in {{\mathbb {R}}}^d\), one aims to learn a prediction model \({\mathbf {w}}\in {{\mathbb {R}}}^d.\) We choose web spam data set ^{6} as the testing bed, which contains 350000 examples, and 16609143 trigrams extracted for each example. We use hinge loss and \(\ell _2^2\) regularizer with \(\lambda =1/n\), where \(n\) is the number of experimented data.
5.7 Comparison for solving \(\ell ^2_2\) regularized SVM
In this subsection, we compare the proposed Pdprox method with Pegasos for solving \(\ell _2^2\) regularized SVM when \(\lambda = O(n^{1/(1+\epsilon }), \epsilon \in (0,1]\). We also compare Pdprox using one step size and two step sizes, and compare them to the accelerated version proposed in Chambolle and Pock (2011) for strongly convex functions. We implement Pdproxdual algorithm (by doubledual implementation) in C++ using the same data structures as coded by Shai ShalevShwartz ^{7}.
The results demonstrate that (1) the two step sizes scheme with careful tuning of the relative ratio yields better convergences than the one step size scheme; (2) Pegasos still remains a stateoftheart algorithm for solving the \(\ell _2^2\) regularized SVM; but when the problem is relatively difficult, i.e., \(\lambda \) is relatively small (e.g., less than \(1/n\)), the Pdprox algorithm with the two step sizes may converge faster in terms of running time; (3) the accelerated version for solving SVM is almost identical the basic version.
6 Conclusions
In this paper, we study nonsmooth optimization in machine learning where both the loss function and the regularizer are nonsmooth. We develop an efficient gradient based method for a family of nonsmooth optimization problems in which the dual form of the loss function can be expressed as a bilinear function in primal and dual variables. We show that, assuming the proximal step can be efficiently solved, the proposed algorithm achieves a convergence rate of \(O(1/T)\), faster than \(O(1/\sqrt{T})\) suffered by many other first order methods for nonsmooth optimization. In contrast to existing studies on nonsmooth optimization, our work enjoys more simplicity in implementation and analysis, and provides a unified methodology for a diverse set of nonsmooth optimization problems. Our empirical studies demonstrate the efficiency of the proposed algorithm in comparison with the stateoftheart first order methods for solving many nonsmooth machine learning problems, and the effectiveness of the proposed algorithm for optimizing the problem with a sparse constraint on the dual variables for tackling the noise in labels. In future, we plan to adapt the proposed algorithm for stochastic updating and for distributed computing environments.
Footnotes
 1.
The extra gradient mapping on \(\varvec{\beta }\) can also be replaced with a simple calculation, as discussed in Sect. 4.6.
 2.
We compare to the deterministic Pegasos that computes the gradient using all examples at each iteration. It would be criticized that it is not fair to compare with Pegasos since it is a stochastic algorithm, however, such a comparison (both theoretically and empirically) would provide a formal evidence that solving the min–max problem by a primal dual method with an extragradient may yield better convergence than solving the primal problem.
 3.
 4.
 5.
 6.
 7.
 8.
We use \(G_{\mathbf {w}}\) and \(G_{\varvec{\alpha }}\) to denote partial gradients.
References
 Argyriou, A., Evgeniou, T., & Pontil, M. (2008). Convex multitask feature learning. Machine Learning, 73, 243–272.CrossRefGoogle Scholar
 Bach, F., Jenatton, R., & Mairal, J. (2011). Optimization with sparsityinducing penalties (foundations and trends(R) in machine learning). Hanover, MA: Now Publishers Inc.MATHGoogle Scholar
 Bartlett, P. L., & Wegkamp, M. H. (2008). Classification with a reject option using a hinge loss. JMLR, 9, 1823–1840.MathSciNetMATHGoogle Scholar
 Beck, A., & Teboulle, M. (2009). A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2, 183–202.MathSciNetCrossRefMATHGoogle Scholar
 Radu loan Bot Ernö Robert Csetnek, A.H. (2012). A primaldual splitting algorithm for finding zeros of sums of maximally monotone operators. ArXiv eprints.Google Scholar
 Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundation and Trends in Machine Learning, 3, 1–122.CrossRefMATHGoogle Scholar
 Boyd, S., & Vandenberghe, L. (2004). Convex optimization. New York: Cambridge University Press.CrossRefMATHGoogle Scholar
 Bredies, K. (2009). A forwardbackward splitting algorithm for the minimization of nonsmooth convex functionals in banach space. Inverse Problems 25, Article ID 015,005, p 20.Google Scholar
 Cai, Y., Sun, Y., Cheng, Y., Li, J., Goodison, S. (2010). Fast implementation of l1 regularized learning algorithms using gradient descent methods. In SDM, pp. 862–871.Google Scholar
 Candès, E.J., Recht, B. (2008). Exact matrix completion via convex optimization. CoRR abs/0805.4471.Google Scholar
 Chambolle, A., & Pock, T. (2011). A firstorder primaldual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40, 120–145.MathSciNetCrossRefMATHGoogle Scholar
 Chen, X., Pan, W., Kwok, J.T., Carbonell, J.G. (2009). Accelerated gradient method for multitask sparse learning problem. In ICDM, pp. 746–751.Google Scholar
 Combettes, P.L., Pesquet, J.C.: Primaldual splitting algorithm for solving inclusions with mixtures of composite, Lipschitzian, and parallelsum monotone operators. http://hal.inria.fr/hal00643381.
 Dekel, O., Singer, Y. (2006). Support vector machines on a budget. In NIPS, pp. 345–352.Google Scholar
 Duchi, J., ShalevShwartz, S., Singer, Y., Chandra, T. (2008). Efficient projections onto the l1ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, pp. 272–279.Google Scholar
 Duchi, J., & Singer, Y. (2009). Efficient online and batch learning using forward backward splitting. JMLR, 10, 2899–2934.MathSciNetMATHGoogle Scholar
 Esser, E., Zhang, X., & Chan, T. F. (2010). A general framework for a class of first order PrimalDual algorithms for convex optimization in imaging science. SIAM Journal of Imaging Sciences, 3, 1015–1046.MathSciNetCrossRefMATHGoogle Scholar
 Fung, G., Mangasarian, O.L. (2002). A feature selection newton method for support vector machine classification. Technical report, Computational Optimization and Applications.Google Scholar
 Gneiting, T. (2008). Quantiles as optimal point predictors. Technical report: Department of Statistics, University of Washington.Google Scholar
 Hastie, T., Tibshirani, R., & Friedman, J. (2008). The elements of statistical learning: Data mining, inference and prediction. Heidelberg: Springer.MATHGoogle Scholar
 He, B., & Yuan, X. (2012). Convergence analysis of primaldual algorithms for a saddlepoint problem: From contraction perspective. SIAM Journal on Imaging Science, 5, 119–149.MathSciNetCrossRefMATHGoogle Scholar
 Hou, C., Nie, F., Yi, D., Wu, Y. (2011). Feature selection via joint embedding learning and sparse regression. In IJCAI, pp. 1324–1329.Google Scholar
 Hsieh, C.J., Chang, K.W., Lin, C.J., Keerthi, S.S., Sundararajan, S. (2008). A dual coordinate descent method for largescale linear svm. In ICML, pp. 408–415.Google Scholar
 Hu, C., Kwok, J., Pan, W. (2009). Accelerated gradient methods for stochastic optimization and online learning. In NIPS, pp. 781–789.Google Scholar
 Huang, K., Jin, R., Xu, Z., Liu, C.L. (2010) Robust metric learning by smooth optimization. In UAI, pp. 244–251.Google Scholar
 Ji, S., Ye, J. (2009). An accelerated gradient method for trace norm minimization. In ICML, pp. 457–464.Google Scholar
 Joachims, T. (1999). Making largescale support vector machine learning practical. In Advances in Kernel methods: Support vector, learning, pp. 169–184.Google Scholar
 Joachims, T. (2006). Training linear svms in linear time. In KDD, pp. 217–226.Google Scholar
 Koenker, R. (2005). Quantile regression. Cambridge: Cambridge University Press.CrossRefMATHGoogle Scholar
 Lan, G. (2010). An optimal method for stochastic composite optimization. Mathematical Programming, 133, 365–397.CrossRefMATHMathSciNetGoogle Scholar
 Lan, G., Lu, Z., & Monteiro, R. D. C. (2011). Primaldual firstorder methods with 1/epsilon iterationcomplexity for cone programming. Mathematical Programming, 126, 1–29.MathSciNetCrossRefMATHGoogle Scholar
 Lin, Q. (2010). A smoothing stochastic gradient method for composite optimization. ArXiv eprints.Google Scholar
 Lions, P. L., & Mercier, B. (1979). Splitting algorithms for the sum of two nonlinear operators. Siam Journal on Numerical Analysis, 16, 964–979.MathSciNetCrossRefMATHGoogle Scholar
 Liu, J., Ji, S., Ye, J. (2009). Multitask feature learning via efficient l2, 1norm minimization. In UAI, pp. 339–348.Google Scholar
 Mosci, S., Villa, S., Verri, A., Rosasco, L. (2010). A primaldual algorithm for group sparse regularization with overlapping groups. In NIPS, pp. 2604–2612.Google Scholar
 Nemirovski, A. (2005). Proxmethod with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convexconcave saddle point problems. SIAM Journal on Optimization, 15, 229–251.MathSciNetCrossRefMATHGoogle Scholar
 Nesterov, Y. (2005). Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization, 16, 235–249.MathSciNetCrossRefMATHGoogle Scholar
 Nesterov, Y. (2005). Smooth minimization of nonsmooth functions. Mathematical Programming, 103, 127–152.MathSciNetCrossRefMATHGoogle Scholar
 Nesterov, Y. (2007). Gradient methods for minimizing composite objective function. Core discussion papers.Google Scholar
 Nie, F., Huang, H., Cai, X., & Ding, C. (2010). Efficient and robust feature selection via joint 2,1norms minimization. Advances in Neural Information Processing Systems, 23, 1813–1821.Google Scholar
 Platt, J.C. (1998). Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel methods: Support vector learning, pp. 185–208. Cambridge, MA.Google Scholar
 Pock, T., Chambolle, A. (2011). Diagonal preconditioning for first order primaldual algorithms in convex optimization. In Proceedings of the 2011 International Conference on Computer Vision, pp. 1762–1769.Google Scholar
 Popov, L. (1980). A modification of the arrowhurwitz method of search for saddle points. Matematicheskie Zametki, 28, 777–784.MathSciNetMATHGoogle Scholar
 Quattoni, A., Carreras, X., Collins, M., Darrell, T. (2009). An efficient projection for l1, infinity regularization. In ICML, pp. 857–864.Google Scholar
 Recht, B., Fazel, M., & Parrilo, P. A. (2010). Guaranteed MinimumRank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52, 471–501.MathSciNetCrossRefMATHGoogle Scholar
 Rennie, J.D.M., Srebro, N. (2005). Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd international conference on Machine learning, pp. 713–719.Google Scholar
 Rockafellar, R. T. (1976). Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 14, 877–898.MathSciNetCrossRefMATHGoogle Scholar
 Rosasco, L., De Vito, E., Caponnetto, A., Piana, M., & Verri, A. (2004). Are loss functions all the same? Neural Computation, 16, 1063–1076.CrossRefMATHGoogle Scholar
 ShalevShwartz, S., Singer, Y., Srebro, N., & Cotter, A. (2011). Pegasos: Primal estimated subgradient solver for svm. Mathematical Programming, 127(1), 3–30.MathSciNetCrossRefMATHGoogle Scholar
 Smale, S., & Zhou, D. X. (2003). Estimating the approximation error in learning theory. Applied Analysis (Singapore), 1(1), 17–41.MathSciNetCrossRefMATHGoogle Scholar
 Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14, 199–222.MathSciNetCrossRefGoogle Scholar
 Srebro, N., Rennie, J.D.M., Jaakkola, T.S. (2005). Maximummargin matrix factorization. In Advances in neural information processing systems, pp. 1329–1336.Google Scholar
 Sun, L., Liu, J., Chen, J., & Ye, J. (2009). Efficient recovery of jointly sparse vectors. Advances in Neural Information Processing Systems, 22, 1812–1820.Google Scholar
 Traub, J. F., Wasilkowski, G. W., & Woźniakowski, H. (1988). Informationbased complexity. San Diego: Academic Press Professional Inc.MATHGoogle Scholar
 Tseng, P. (2008). On accelerated proximal gradient methods for convex–concave optimization. Technical report.Google Scholar
 Vapnik, V. (1998). Statistical learning theory. New York: WileyInterscience.MATHGoogle Scholar
 Wu, Q., & Zhou, D. X. (2005). Svm soft margin classifiers: Linear programming versus quadratic programming. Neural Computation, 17, 1160–1187.MathSciNetCrossRefMATHGoogle Scholar
 Xiao, L. (2009). Dual averaging method for regularized stochastic learning and online optimization. In: Y. Bengio, D. Schuurmans, J. Lafferty, C.K.I. Williams, A. Culotta (eds.) NIPS, pp. 2116–2124.Google Scholar
 Yang, H., Xu, Z., King, I., Lyu, M.R. (2010) Online learning for group lasso. In ICML, pp. 1191–1198.Google Scholar
 Yang, T., Mahdavi, M., Jin, R., Zhang, L., Zhou, Y. (2012). Multiple kernel learning from noisy labels by stochastic programming. In ICML.Google Scholar
 Yeo, G., Burge, C.B. (2003). Maximum entropy modeling of short sequence motifs with applications to rna splicing signals. In RECOMB, pp. 322–331.Google Scholar
 Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. JRSS, 68, 49–67.MathSciNetCrossRefMATHGoogle Scholar
 Zhou, T., Tao, D., Wu, X. (2010). Nesvm: A fast gradient method for support vector machines. CoRR abs/1008.4000.Google Scholar
 Zhou, Y., Jin, R., Hoi, S.C. (2010). Exclusive lasso for multitask feature selection. In AISTAT, pp. 988–995.Google Scholar
 Zhu, J., Rosset, S., Hastie, T., Tibshirani, R. (2003). 1norm support vector machines. In NIPS.Google Scholar
 Zhu, M., Chan, T. (2008). An efficient primaldual hybrid gradient algorithm for total variation image restoration. UCLA CAM Report pp. 08–34.Google Scholar