Robust multicategory support vector machines using difference convex algorithm

Zhang, Chong; Pham, Minh; Fu, Sheng; Liu, Yufeng

doi:10.1007/s10107-017-1209-5

Robust multicategory support vector machines using difference convex algorithm

Full Length Paper
Series B
Published: 29 November 2017

Volume 169, pages 277–305, (2018)
Cite this article

Mathematical Programming Submit manuscript

Chong Zhang¹,
Minh Pham^2,3,
Sheng Fu⁴ &
…
Yufeng Liu ORCID: orcid.org/0000-0002-1686-0545⁵

928 Accesses
13 Citations
Explore all metrics

Abstract

The support vector machine (SVM) is one of the most popular classification methods in the machine learning literature. Binary SVM methods have been extensively studied, and have achieved many successes in various disciplines. However, generalization to multicategory SVM (MSVM) methods can be very challenging. Many existing methods estimate k functions for k classes with an explicit sum-to-zero constraint. It was shown recently that such a formulation can be suboptimal. Moreover, many existing MSVMs are not Fisher consistent, or do not take into account the effect of outliers. In this paper, we focus on classification in the angle-based framework, which is free of the explicit sum-to-zero constraint, hence more efficient, and propose two robust MSVM methods using truncated hinge loss functions. We show that our new classifiers can enjoy Fisher consistency, and simultaneously alleviate the impact of outliers to achieve more stable classification performance. To implement our proposed classifiers, we employ the difference convex algorithm for efficient computation. Theoretical and numerical results obtained indicate that for problems with potential outliers, our robust angle-based MSVMs can be very competitive among existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel multi-class SVM model using second-order cone constraints

Article 09 September 2015

Robust multicategory support matrix machines

Article 28 March 2019

DCA Based Algorithms for Feature Selection in Semi-supervised Support Vector Machines

References

Arora, S., Bhattacharjee, D., Nasipuri, M., Malik, L., Kundu, M., Basu, D.K.: Performance Comparison of SVM and ANN for Handwritten Devnagari Character Recognition. arXiv preprint arXiv:1006.5902 (2010)
Bache, K., Lichman, M.: UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml (2013)
Bartlett, P.L., Mendelson, S.: Rademacher and Gaussian complexities: risk bounds and structural results. J. Mach. Learn. Res. 3, 463–482 (2002)
MathSciNet MATH Google Scholar
Bartlett, P.L., Bousquet, O., Mendelson, S.: Local rademacher complexities. Ann. Stat. 33(4), 1497–1537 (2005)
Article MathSciNet MATH Google Scholar
Bartlett, P.L., Jordan, M.I., McAuliffe, J.D.: Convexity, classification, and risk bounds. J. Am. Stat. Assoc. 101, 138–156 (2006)
Article MathSciNet MATH Google Scholar
Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Haussler, D. (ed.) Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, pp. 144–152. Association for Computing Machinery, New York (1992). https://doi.org/10.1145/130385.130401
Chapter Google Scholar
Caruana, R., Karampatziakis, N., Yessenalina, A.: An empirical evaluation of supervised learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning, pp. 96–103. ACM (2008)
Cortes, C., Vapnik, V.N.: Support vector networks. Mach. Learn. 20, 273–297 (1995)
MATH Google Scholar
Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res. 2, 265–292 (2001)
MATH Google Scholar
Cristianini, N., Shawe-Taylor, J.S.: An Introduction to Support Vector Machines, 1st edn. Cambridge University Press, Cambridge (2000)
MATH Google Scholar
Demšar, J., Curk, T., Erjavec, A., Črt Gorup, Hočevar, T., Milutinovič, M., Možina, M., Polajnar, M., Toplak, M., Starič, A., Štajdohar, M., Umek, L., Žagar, L., Žbontar, J., Žitnik, M., Zupan, B.: Orange: data mining toolbox in python. J. Mach. Learn. Res. 14:2349–2353. http://jmlr.org/papers/v14/demsar13a.html (2013)
Freund, Y., Schapire, R.E.: A Desicion-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)
Article MATH Google Scholar
Guermeur, Y., Monfrini, E.: A quadratic loss multi-class SVM for which a radius-margin bound applies. Informatica 22(1), 73–96 (2011)
MathSciNet MATH Google Scholar
Hastie, T.J., Tibshirani, R.J., Friedman, J.H.: The Elements of Statistical Learning, 2nd edn. Springer, New York (2009)
Book MATH Google Scholar
Hsieh, C., Chang, K., Lin, C., Keerthi, S., Sundarajan, S.: A dual coordinate descent method for large-scale linear SVM. In: Proceedings of the 25th International Conference on Machine Learning, Proceeding ICML ’08, pp. 408–415 (2008)
Justino, E.J.R., Bortolozzi, F., Sabourin, R.: A comparison of SVM and HMM classifiers in the off-line signature verification. Pattern Recognit. Lett. 26(9), 1377–1385 (2005)
Article Google Scholar
Kiwiel, K., Rosa, C., Ruszczynski, A.: Proximal decomposition via alternating linearization. SIAM J. Optim. 9(3), 668–689 (1999)
Article MathSciNet MATH Google Scholar
Koltchinskii, V.: Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Stat. 34(6), 2593–2656 (2006)
Article MathSciNet MATH Google Scholar
Koltchinskii, V., Panchenko, D.: Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Stat. 30(1), 1–50 (2002)
MathSciNet MATH Google Scholar
Le Thi, H.A., Pham Dinh, T.: Solving a class of linearly constrained indefinite quadratic problems by DC algorithms. J. Glob. Optim. 11(3), 253–285 (1997)
Article MATH Google Scholar
Le Thi, H.A., Pham Dinh, T.: The DC (difference of convex functions) programming and DCA revisited with dc models of real world nonconvex optimization problems. Ann. Oper. Res. 133, 23–46 (2005)
Article MathSciNet MATH Google Scholar
Le Thi, H.A., Pham Dinh, T.: The State of the Art in DC Programming and DCA. Research Report (60 pages), Lorraine University (2013)
Le Thi, H.A., Pham Dinh, T.: Recent advances in DC programming and DCA. Trans. Comput. Collect. Intell. 8342, 1–37 (2014)
Google Scholar
Le Thi, H.A., Le, H.M., Pham Dinh, T.: A dc programming approach for feature selection in support vector machines learning. Adv. Data Anal. Classif. 2(3), 259–278 (2008)
Article MathSciNet MATH Google Scholar
Le Thi, H.A., Huynh, V.N., Pham Dinh, T.: DC programming and DCA for general DC programs. Adv. Intell. Syst. Comput. 15–35. ISBN 978-3-319-06568-7 (2014)
Lee, Y., Lin, Y., Wahba, G.: Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data. J. Am. Stat. Assoc. 99, 67–81 (2004)
Article MathSciNet MATH Google Scholar
Lin, X., Wahba, G., Xiang, D., Gao, F., Klein, R., Klein, B.: Smoothing spline ANOVA models for large data sets with bernoulli observations and the randomized GACV. Ann. Stat. 28(6), 1570–1600 (2000)
Article MathSciNet MATH Google Scholar
Lin, X., Pham, M., Ruszczynski, A.: Alternating linearization for structured regularization problem. J. Mach. Learn. Res. 15, 3447–3481 (2014)
MathSciNet MATH Google Scholar
Lin, Y.: Some Asymptotic Properties of the Support Vector Machine. Technical Report 1044r, Department of Statistics, University of Wisconsin, Madison (1999)
Liu Y (2007) Fisher consistency of multicategory support vector machines. In: Eleventh International Conference on Artificial Intelligence and Statistics, pp. 289–296
Liu, Y., Shen, X.: Multicategory $\psi $-learning. J. Am. Stat. Assoc. 101, 500–509 (2006)
Article MathSciNet MATH Google Scholar
Liu, Y., Yuan, M.: Reinforced multicategory support vector machines. J. Comput. Gr. Stat. 20(4), 901–919 (2011)
Article MathSciNet Google Scholar
Liu, Y., Zhang, H.H., Wu, Y.: Soft or hard classification? Large margin unified machines. J. Am. Stat. Assoc. 106, 166–177 (2011)
Article MATH Google Scholar
McDiarmid, C.: On the method of bounded differences. In: Surveys in Combinatorics, Cambridge University Press, Cambridge, pp. 148–188 (1989)
Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. MIT Press, Cambridge, MA (2012)
MATH Google Scholar
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(4), 341–362 (2012)
Article MathSciNet MATH Google Scholar
Pang, J.S., Razaviyayn, M., Alvarado, A.: Computing B-stationary points of nonsmooth DC programs. Math. Oper. Res. 42, 95–118 (2016)
Article MathSciNet MATH Google Scholar
Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Schölkopf, B., Burges, J.C., Smola, A.J. (eds.) Advances in Kernel Methods: Support Vector Learning, pp. 185–208. MIT Press, Cambridge, MA, USA (1999)
Google Scholar
Shawe-Taylor, J.S., Cristianini, N.: Kernel Methods for Pattern Analysis, 1st edn. Cambridge University Press, Cambridge (2004)
Book MATH Google Scholar
Steinwart, I., Scovel, C.: Fast rates for support vector machines using Gaussian kernels. Ann. Stat. 35(2), 575–607 (2007)
Article MathSciNet MATH Google Scholar
Tseng, P.: A coordinate gradient descent method for linearly constrained smooth optimization and support vector machines training. J. Comput. Optim. Appl. 47(4), 179–206 (2010)
Article MathSciNet MATH Google Scholar
van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes with Application to Statistics, 1st edn. Springer, Berlin, New York, NY (2000)
MATH Google Scholar
Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998)
MATH Google Scholar
Wahba, G.: Support vector machines, reproducing kernel Hilbert spaces and the randomized GACV. In: Schölkopf, B., Burges, J.C., Smola, A.J. (eds.) Advances in Kernel Methods: Support Vector learning, pp. 69–88. MIT Press, Cambridge, MA, USA (1999)
Google Scholar
Wang, L., Shen, X.: On $L_1$-norm multi-class support vector machines: methodology and theory. J. Am. Stat. Assoc. 102, 595–602 (2007)
Article Google Scholar
Wang, L., Zhu, J., Zou, H.: The doubly regularized support vector machine. Stat. Sin. 16, 589–615 (2006)
MathSciNet MATH Google Scholar
Wu, Y., Liu, Y.: On multicategory truncated-hinge-loss support vector. In: Prediction and Discovery: AMS-IMS-SIAM Joint Summer Research Conference, Machine and Statistical Learning: Prediction and Discovery, June 25–29, 2006, Snowbird, Utah, American Mathematical Society, vol. 443, pp. 49–58 (2006)
Wu, Y., Liu, Y.: Robust truncated hinge loss support vector machines. J. Am. Stat. Assoc. 102(479), 974–983 (2007)
Article MathSciNet MATH Google Scholar
Zhang, C., Liu, Y.: Multicategory angle-based large-margin classification. Biometrika 101(3), 625–640 (2014)
Article MathSciNet MATH Google Scholar
Zhang, C., Liu, Y., Wang, J., Zhu, H.: Reinforced angle-based multicategory support vector machines. J. Comput. Gr. Stat. 25, 806–825 (2016)
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors would like to thank the reviewers and editors for their helpful comments and suggestions which led to a much improved presentation. Yufeng Liu’s research was supported in part by National Science Foundation Grant IIS1632951 and National Institute of Health Grant R01GM126550. Chong Zhang’s research was supported in part by National Science and Engineering Research Council of Canada (NSERC). Pham was supported in part by National Science Foundation Grant DMS1127914 and the Hobby Postdoctoral Fellowship.

Author information

Authors and Affiliations

Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
Chong Zhang
Statistical and Applied Mathematical Sciences Institute (SAMSI), Durham, NC, USA
Minh Pham
Department of Statistics, University of Virginia, Charlottesville, VA, USA
Minh Pham
University of Chinese Academy of Sciences, Beijing, China
Sheng Fu
Department of Statistics and Operations Research, Department of Genetics, Department of Biostatistics, Carolina Center for Genome Sciences, Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Yufeng Liu

Authors

Chong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Minh Pham
View author publications
You can also search for this author in PubMed Google Scholar
Sheng Fu
View author publications
You can also search for this author in PubMed Google Scholar
Yufeng Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yufeng Liu.

Appendix

Proof of Theorem 4

To prove the theorem, we need the following lemma, of which the proof can be found in [49].

Lemma 1

([49], Lemma 1) Suppose we have an arbitrary $\varvec{f}\in \mathbb {R}^{k-1}$. For any $u, v\in \{1,\ldots ,k\}$ such that $u \ne v$, define $\varvec{T}_{u,v}=\varvec{W}_u-\varvec{W}_v$. For any scalar $z \in \mathbb {R}$, $\langle (\varvec{f}+z \varvec{T}_{u,v}),\varvec{W}_w \rangle = \langle \varvec{f},\varvec{W}_w \rangle $, where $w\in \{1,\ldots ,k\}$ and $w \ne u, v$. Furthermore, we have that $\langle (\varvec{f}+z \varvec{T}_{u,v}),\varvec{W}_u \rangle - \langle \varvec{f},\varvec{W}_u\rangle = - \langle (\varvec{f}+z \varvec{T}_{v,u}),\varvec{W}_v\rangle + \langle \varvec{f},\varvec{W}_v\rangle $.

We first prove that the loss function (3) is Fisher consistent with $\gamma \le 1/2$ and $s \le 0$. Assume that, without loss of generality, $P_1 > P_2 \ge \cdots \ge P_k$. The goal is to show that $\langle \varvec{f}^*, \varvec{W}_1 \rangle > \langle \varvec{f}^*, \varvec{W}_j \rangle $ for any $j \ne 1$. Inspired by the proof in [32, 47], we complete our proof with four steps.

The first step is to show that there exists a theoretical minimizer $\varvec{f}^*$ such that $\langle \varvec{f}^*, \varvec{W}_j \rangle \le k-1$ for all j. Otherwise, suppose $\langle \varvec{f}^*, \varvec{W}_i \rangle > k-1$ for some i, there must exist $q \in \{1,\ldots ,k\},~q\ne i$, such that $\langle \varvec{f}^*, \varvec{W}_q \rangle < -1$, by the sum-to-zero property of the inner products $\sum _{j=1}^k \langle \varvec{f}^*, \varvec{W}_j \rangle = 0$. Now one can decrease $\langle \varvec{f}^*, \varvec{W}_i \rangle $ by a small amount, and increase $\langle \varvec{f}^*, \varvec{W}_q \rangle $ by the same amount (Lemma 1), and the variation of the conditional loss depends on s. Specifically, if $s<-\frac{1}{k-1}$, the conditional loss is decreased, which is a contradiction to the definition of $\varvec{f}^*$. If $-\frac{1}{k-1}\le s\le 0$, the conditional loss remains the same. However, one can keep doing this till all $\langle \varvec{f}^*, \varvec{W}_j \rangle \le k-1$, which is a contradiction to the assumption $\langle \varvec{f}^*, \varvec{W}_i \rangle > k-1$.

The second step is to show that $\langle \varvec{f}^*, \varvec{W}_1 \rangle \ge \langle \varvec{f}^*, \varvec{W}_j \rangle $ for any $j\ne 1$ using contradiction. Suppose $\langle \varvec{f}^*, \varvec{W}_1 \rangle < \langle \varvec{f}^*, \varvec{W}_i \rangle $ for some i. By the definition of $S(\varvec{x}) = E[ \phi _1\{\varvec{f}(\varvec{X}),Y\} \mid \varvec{X}=\varvec{x}]$, we can simplify it as

$$\begin{aligned} S(\varvec{x}) \triangleq \sum _{j=1}^k P_j h_\gamma (\langle \varvec{f},~\varvec{W}_{j}\rangle )+(1-\gamma )\sum _{j=1}^k R_{s}(\langle \varvec{f},~\varvec{W}_{j}\rangle ). \end{aligned}$$

where $h_\gamma (u)= \gamma T_{(k-1)s}(u)-(1-\gamma )R_{s}(u)$. Because $h_\gamma (u)$ is monotone decreasing for any $0\le \gamma \le 1$, we have $h_\gamma (\langle \varvec{f}^*,~\varvec{W}_{1}\rangle )\ge h_\gamma (\langle \varvec{f}^*,~\varvec{W}_{i}\rangle )$. We claim that $h_\gamma (\langle \varvec{f}^*,~\varvec{W}_{1}\rangle )\le h_\gamma (\langle \varvec{f}^*,~\varvec{W}_{j}\rangle )$ for all $j\ne 1$. If it is not true, there must exist $h_\gamma (\langle \varvec{f}^*,~\varvec{W}_{1}\rangle )> h_\gamma (\langle \varvec{f}^*,~\varvec{W}_{j}\rangle )$ for some j. Then we can define $\varvec{f}^{\prime }(\varvec{x}) \in \mathbb {R}^{k-1}$ such that $\langle \varvec{f}^*, \varvec{W}_1 \rangle = \langle \varvec{f}^{\prime }, \varvec{W}_j \rangle $ and $\langle \varvec{f}^*, \varvec{W}_j \rangle = \langle \varvec{f}^{\prime }, \varvec{W}_1 \rangle $ (the existence of such $\varvec{f}^{\prime }$ is guaranteed by Lemma 1). One can verify that $\varvec{f}^{\prime }$ is the minimizer of $S(\varvec{x})$, not $\varvec{f}^*$. This contradicts with the definition of $\varvec{f}^*$. Therefore, we obtain $h_\gamma (\langle \varvec{f}^*,~\varvec{W}_{1}\rangle )=h_\gamma (\langle \varvec{f}^*,~\varvec{W}_{i}\rangle )$. Because $h_\gamma (\cdot )$ is flat in $(-\infty ,~\min (-1,(k-1)s)]$ and $[\max (k-1,-s),~+\infty )$, $\langle \varvec{f}^*,~\varvec{W}_{1}\rangle ~\text {and}~\langle \varvec{f}^*,~\varvec{W}_{i}\rangle $ lie in the same interval simultaneously. If $\langle \varvec{f}^*,~\varvec{W}_{1}\rangle ,~\langle \varvec{f}^*,~\varvec{W}_{i}\rangle \in (-\infty ,~\min (-1,(k-1)s)]<0$, then all $\langle \varvec{f}^*,~\varvec{W}_{j}\rangle <0$, which is a contradiction to the sum-to-zero property. Thus $\langle \varvec{f}^*,~\varvec{W}_{1}\rangle ,~\langle \varvec{f}^*,~\varvec{W}_{i}\rangle \in [\max (k-1,-s),~+\infty )$. If $s<-(k-1)$, then $-s\le \langle \varvec{f}^*,~\varvec{W}_{1}\rangle \le k-1$, which is a contradiction. If $0\ge s\ge -(k-1)$, based on the fact that $\langle \varvec{f}^*,~\varvec{W}_{j}\rangle \le k-1$ for all j, then $\langle \varvec{f}^*,~\varvec{W}_{1}\rangle =\langle \varvec{f}^*,~\varvec{W}_{i}\rangle =k-1$, which contradicts with the assumption. Hence, we must have that $\langle \varvec{f}^*, \varvec{W}_1 \rangle \ge \langle \varvec{f}^*, \varvec{W}_j \rangle $ for all $j\ne 1$.

The third step is to show that when $\gamma \le 1/2$, $\langle \varvec{f}^*, \varvec{W}_j \rangle \ge -1$ for all j. Suppose this is not true, and $\langle \varvec{f}^*, \varvec{W}_i \rangle < -1$ for some $i \ne 1$. There must exist $q \in \{1,\ldots ,k\}, q \ne 1, i$, such that $-1< \langle \varvec{f}^*, \varvec{W}_q \rangle \le k-1 $. In this case, we can decrease $\langle \varvec{f}^*, \varvec{W}_q \rangle $ by a small amount and increase $\langle \varvec{f}^*, \varvec{W}_i \rangle $ by the same amount, such that the conditional loss $S(\varvec{x})$ is decreased, which contradicts with the optimality of $\varvec{f}^*$.

The last step is to show that $\langle \varvec{f}^*, \varvec{W}_j \rangle \le 0$ for any $j\ne 1$. If $\langle \varvec{f}^*, \varvec{W}_i\rangle >0$ for some i, then $\langle \varvec{f}^*, \varvec{W}_1\rangle < k-1$. Otherwise, $\langle \varvec{f}^*, \varvec{W}_1 \rangle = k-1$. According to third part, we have $\langle \varvec{f}^*, \varvec{W}_j\rangle =-1,~j\ne 1$. Especially, $\langle \varvec{f}^*, \varvec{W}_i\rangle =-1<0$, which is a contradiction to the assumption $\langle \varvec{f}^*, \varvec{W}_i\rangle >0$. Then $\langle \varvec{f}^*, \varvec{W}_1 \rangle < k-1$. Now we can decrease $\langle \varvec{f}^*, \varvec{W}_i \rangle $ by a small amount, and increase $\langle \varvec{f}^*, \varvec{W}_1 \rangle $ by the same amount (Lemma 1), and the conditional loss is decreased. It indicates that the current $\varvec{f}^*$ is not optimal, which is a contradiction. Based on the sum-to-zero property, we show the fact $\langle \varvec{f}^*, \varvec{W}_1 \rangle >0\ge \langle \varvec{f}^*, \varvec{W}_j \rangle $ for any $j\ne 1$. This completes the first part of the proof, that (3) is Fisher consistent when $\gamma \le 1/2$ and $s \le 0$.

We proceed to show the Fisher consistency of (6) when $s \in [-1/(k-1), 0]$. Again, without loss of generality, $P_1 > P_2 \ge \cdots \ge P_k$. By similar arguments as above, one can verify that $\langle \varvec{f}^*, \varvec{W}_1 \rangle \ge 0$, and $\langle \varvec{f}^*, \varvec{W}_i \rangle \ge \langle \varvec{f}^*, \varvec{W}_j \rangle $ if $i < j$. Therefore, it remains to show that $\varvec{0}$ is not the minimizer for the choice of s. To this end, notice that for $s \in [-\frac{1}{k-1},0]$, we have $\frac{1}{-s} \ge k-1$. Consequently, there exists $t \in (0,1]$ such that $\frac{t}{-s} \ge k-1$. Consider a $\varvec{f}$ such that $\langle \varvec{f}, \varvec{W}_1\rangle =\frac{t(k-1)}{k}$ and $\langle \varvec{f}, \varvec{W}_j \rangle =-\frac{t}{k}$ for $j\ne 1$. One can verify that $\varvec{f}$ yields a smaller conditional loss, compared to $\varvec{0}$. Thus, the robust SVM (6) is Fisher consistent.$\square $

Proof of Theorem 2

We can prove the theorem using a recent technique in the statistical machine learning literature, namely, the Rademacher complexity [3, 4, 18, 19, 35, 39]. To begin with, let $\sigma = \{\sigma _i; \ i=1,\ldots ,n \}$ be independent and identically distributed random variables, that take 1 and $-1$ with probability 1/2 each. Denote by S a sample of observations $(\varvec{x}_i,y_i); \ i=1,\ldots ,n $, independent and identically distributed from the underlying distribution $P(\varvec{X},Y)$. For a function class $\mathscr {F}= \{f:f(\varvec{x},y)\}$ and given S, we define the empirical Rademacher complexity of $\mathscr {F}$ to be

$$\begin{aligned} \hat{R}_n (\mathscr {F}) = E_{\sigma } \left\{ \sup _{f \in \mathscr {F}} \frac{1}{n} \sum _{i=1}^n \sigma _i f(\varvec{x}_i,y_i)\right\} . \end{aligned}$$

Here $E_{\sigma }$ means taking expectation with respect to the distribution of $\sigma $. Furthermore, define the Rademacher complexity of $\mathscr {F}$ to be

$$\begin{aligned} R_n(\mathscr {F}) = E_{\sigma , S} \left\{ \sup _{f \in \mathscr {F}} \frac{1}{n} \sum _{i=1}^n \sigma _i f(\varvec{x}_i,y_i)\right\} . \end{aligned}$$

Another key step in the proof is to notice that the indicator function in (16) is discontinuous, thus it is difficult to bound the corresponding Rademacher complexity directly. To overcome this challenge, we can consider a continuous upper bound of the indicator function. In particular, for any $\hat{\varvec{f}}$, let $I_{\kappa }$ be defined as follows

$$\begin{aligned} I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}), y\} = \left\{ \begin{array}{ll} 1 &{} \text {if } y \ne \hat{y}_{\hat{\varvec{f}}} (\varvec{x}) ,\\ 1 - \frac{1}{\kappa } \min _{j \ne y} \langle \hat{\varvec{f}}(\varvec{x}), \varvec{W}_y-\varvec{W}_j \rangle &{} \text {if } y = \hat{y}_{\hat{\varvec{f}}} (\varvec{x}) \text { and } \\ &{}\quad \min _{j \ne y} \langle \hat{\varvec{f}}(\varvec{x}), \varvec{W}_y-\varvec{W}_j \rangle \le \kappa , \\ 0 &{} \text {otherwise}, \end{array} \right. \end{aligned}$$

where $\kappa $ is a small positive number to be determined later. One can verify that $I_{\kappa }$ is a continuous upper bound of the indicator function in (16). In the following proof, we focus on bounding the Rademacher complexity of $I_{\kappa }$, where $\hat{\varvec{f}}$ is obtained from the optimization problems (3) and (6).

Our goal is to show that with probability at least $1-\delta $ ($0< \delta <1$), $E [ I_{\kappa } \{\hat{\varvec{f}}(\varvec{X}), Y\} ]$ is bounded by the summation of its empirical evaluation, the Rademacher complexity of the function class $\mathscr {F}$, and a penalty term on $\delta $. The proof of Theorem 2 consists of two major steps. In particular, we have the following two lemmas.

Lemma 2

Let $R_n(\mathscr {F})$ and $\hat{R}_n(\mathscr {F})$ be defined with respect to the $I_{\kappa }$ function. Then, with probability at least $1-\delta $,

$$\begin{aligned} E [ I_{\kappa } \{\hat{\varvec{f}}(\varvec{X}), Y\} ] \le \frac{1}{n} \sum _{i=1}^n I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\} + 2 R_n (\mathscr {F}) + T_n (\delta ), \end{aligned}$$

(17)

where $T_n (\delta ) = \{\log (1/\delta ) / n\}^{1/2} $.

Moreover, with probability at least $1-\delta $,

$$\begin{aligned} E [ I_{\kappa } \{\hat{\varvec{f}}(\varvec{X}), Y\} ] \le \frac{1}{n} \sum _{i=1}^n I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\} + 2 \hat{R}_n (\mathscr {F}) + 3T_n (\delta /2). \end{aligned}$$

Lemma 3

Let $s = 1/\lambda $. In linear learning, when we use the $L_1$ penalty, the empirical Rademacher complexity $\hat{R}_n (\mathscr {F}) \le \frac{s}{\kappa } \sqrt{\frac{2\log (2pk-2p)}{n}}$, and when we use the $L_2$ penalty, $\hat{R}_n (\mathscr {F}) \le \{2(k-1) (ps)^{1/2}\}/( \kappa n^{1/4}) + \{2 (ps)^{1/2}\} \Big (\log [e+e\{2p(k-1)\} ]/(n^{1/2})\Big )^{1/2}/( \kappa n^{1/4})$. For kernel learning with separable kernel functions, the empirical Rademacher complexity $\hat{R}_n (\mathscr {F}) \le \frac{s(k-1)}{\kappa \sqrt{n}} $.

Proof of Lemma 2

The proof consists of three parts. For the first part, we use the McDiarmid inequality [34] to bound the left hand side of (17), in terms of its empirical estimation, plus the expectation of their supremum difference, $E (\phi )$, which is to be defined below. For the second part, we show that $E (\phi )$ is bounded by the Rademacher complexity using symmetrization inequalities [42]. For the third part, we prove that one can bound the Rademacher complexity using the empirical Rademacher complexity.

For a given sample S, we define

$$\begin{aligned} \phi (S) = \sup _{\varvec{f}\in \mathscr {F}} \left( E [ I_{ \kappa } \{\hat{\varvec{f}}(\varvec{X}), Y\} ] - \frac{1}{n} \sum _{i=1}^n I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\} \right) . \end{aligned}$$

Let $S^{(i,\varvec{x})} = \{ (\varvec{x}_1,y_1), \ldots , (\varvec{x}_i', y_i), \ldots , (\varvec{x}_n,y_n)\}$ be another sample from $P(\varvec{X},Y)$, where the difference between S and $S^{(i,\varvec{x})}$ is only on the $\varvec{x}$ value of their ith pair. By definition, we have

$$\begin{aligned} |\phi (S) - \phi (S^{(i,\varvec{x})})|&= \left| \sup _{\varvec{f}\in \mathscr {F}} \left( E [ I_{ \kappa } \{\hat{\varvec{f}}(\varvec{X}), Y\} ] - \frac{1}{n} \sum _S I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\} \right) \right. \\&\quad - \left. \sup _{\varvec{f}\in \mathscr {F}} \left( E [ I_{ \kappa } \{\hat{\varvec{f}}(\varvec{X}), Y\} ] - \frac{1}{n} \sum _{S^{(i,\varvec{x})}} I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\} \right) \right| . \end{aligned}$$

For simplicity, suppose that $\varvec{f}^S$ is the function that achieves the supremum of $\phi (S)$. We note that the case of no function achieving the supremum can be treated analogously, with only additional discussions on the arbitrarily small difference between $\phi (\varvec{f})$ and its supremum. Thus, we omit the details here. We have that,

$$\begin{aligned} |\phi (S) - \phi (S^{(i,\varvec{x})})| \le&\left| E [ I_{ \kappa } \{ \varvec{f}^S (\varvec{X}), Y\}] - \frac{1}{n} \sum _S I_{\kappa } \{ \varvec{f}^S (\varvec{x}_i), y_i\} \right. \\&- \left. E [ I_{ \kappa } \{ \varvec{f}^S (\varvec{X}), Y\} ] + \frac{1}{n} \sum _{S^{(i,\varvec{x})}} I_{\kappa } \{ \varvec{f}^S (\varvec{x}_i), y_i\} \right| . \\ =\,&\frac{1}{n} \left| \sum _S I_{\kappa } \{ \varvec{f}^S (\varvec{x}_i), y_i\} - \sum _{S^{(i,\varvec{x})}} I_{\kappa } \{ \varvec{f}^S (\varvec{x}_i), y_i\} \right| \\ \le \,&\frac{1}{n}. \end{aligned}$$

Next, by the McDiarmid inequality, we have that for any $t>0$, $\text {pr}[\phi (S) - E\{\phi (S)\} \ge t] \le \exp [-(2t^2)/\{2n(1/n)^2\}]$, or equivalently, with probability at least $1-\delta $, $\phi (S) - E\{\phi (S)\} \le T_n(\delta )$. Consequently, we have that with probability at least $1-\delta $, $E [ I_{ \kappa } \{ \varvec{f}^S (\varvec{X}), Y\}] \le n^{-1} \sum _{i=1}^n I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\} + E\{\phi (S)\} + T_n(\delta )$. This completes the first part of the proof.

For the second part, we bound $E\{\phi (S)\}$ by the corresponding Rademacher complexity. To this end, define $S^{\prime }= \{(\varvec{x}_i',y_i'); \ i=1,\ldots ,n \}$ as an independent duplicate sample of size n with the identical distribution as S. Denote by $E_S$ the action of taking expectation with respect to the distribution of S, and define $E_{S^{\prime }}$ analogously. By definition, we have that $E_{S^{\prime }} \big [ n^{-1} \sum _{S^{\prime }} I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i^{\prime }), y_i^{\prime }\} \mid S \big ] = E [ I_{\kappa } \{\hat{\varvec{f}}(\varvec{X}), Y\} ]$, and $E_{S^{\prime }} \big [ n^{-1} \sum _{S} I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\} \mid S \big ] = n^{-1} \sum _{S} I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\}$. Then, by Jensen’s inequality and the property of $\sigma $, we have that

$$\begin{aligned} E\{\phi (S)\}&= E_S \left( \sup _{\varvec{f}\in \mathscr {F}} E_{S^{\prime }} \left[ \frac{1}{n} \sum _{S^{\prime }} I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i^{\prime }), y_i^{\prime }\} - \frac{1}{n} \sum _{S} I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\}\right] \mid S \right) \\&\le E_{S,S^{\prime }} \left[ \sup _{\varvec{f}\in \mathscr {F}} \frac{1}{n} \sum _{S^{\prime }} I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i^{\prime }), y_i^{\prime }\} - \frac{1}{n} \sum _{S} I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\} \right] \\&= E_{S,S^{\prime },\sigma } \big [\sup _{\varvec{f}\in \mathscr {F}} \frac{1}{n} \sum _{S^{\prime }} \sigma _i I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i^{\prime }), y_i^{\prime }\} - \frac{1}{n} \sum _{S} \sigma _i I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\} \big ] \\&\le 2R_n \{ \mathscr {F}(p,k,s)\}. \end{aligned}$$

Hence the second part is proved.

In the third step, we need to bound $R_n (\mathscr {F})$ using $\hat{R}_n (\mathscr {F})$. This step is analogous to the first part, and we omit the details here. Briefly speaking, one can apply the McDiarmid inequality on $\hat{R}_n (\mathscr {F})$ and the corresponding expectation $R_n (\mathscr {F}) $. Similar to the first part of this proof, we can show that with probability at least $1-\delta $, $R_n (\mathscr {F}) \le \hat{R}_n (\mathscr {F}) + 2 T_n(\delta )$.

The final results of Lemma 2 can be obtained by choosing the confidence $1-\delta /2$ in the first and third steps, and combining the inequalities of the three steps. $\square $

Proof of Lemma 3

First, we prove that for the obtained $\hat{\varvec{f}}$, $J(\hat{\varvec{f}}) \le s$. To see this, notice that for $\varvec{\beta }_j=0$ and $\beta _{j,0} = 0$, we have

$$\begin{aligned} \frac{1}{n} \sum _{i=1}^n I_{\kappa } \{\varvec{f}(\varvec{x}_i), y_i\} \le 1. \end{aligned}$$

On the other hand, $\hat{\varvec{f}}$ is the solution to the optimization problems in (3) or (6), hence

$$\begin{aligned} \lambda J(\hat{\varvec{f}}) \le \frac{1}{n} \sum _{i=1}^n I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\} + \lambda J(\hat{\varvec{f}}) \le \frac{1}{n} \sum _{i=1}^n I_{\kappa } \{\varvec{f}(\varvec{x}_i), y_i\}, \end{aligned}$$

which yields $J(\hat{\varvec{f}}) \le s$.

For the $L_1$ penalized learning, one can bound the corresponding Rademacher complexity $\hat{R}_n (\mathscr {F})$ in the following way. In particular, by Lemma 4.2 in [35], we have that $\hat{R}_n (\mathscr {F})$ is upper bounded by

$$\begin{aligned} \frac{1}{\kappa } \hat{R}_n^* (\mathscr {F}) = \frac{1}{\kappa } E_{\sigma } \left\{ \sup _{ \sum _{j=1}^{k-1} \Vert \varvec{\beta }_j\Vert _1 < s } \frac{1}{n} \sum _{i=1}^n \sigma _i \left\{ \sum _{j=1}^{k-1} \varvec{x}_i^{\text {T}} \varvec{\beta }_j\right\} \right\} , \end{aligned}$$

(18)

because the continuous indicator function is Lipschitz with constant $1/\kappa $, and elements in $\varvec{W}_j$ are bounded by 1. Without loss of generality, we can rewrite (18) as the following

$$\begin{aligned} \frac{1}{\kappa } \hat{R}_n^* \{ \mathscr {F}(p,k,s) \} = \frac{1}{\kappa } E_{\sigma } \left\{ \sup _{ \Vert \varvec{\gamma }\Vert _1 < s } \frac{1}{n} \sum _{i=1}^n \sigma _i \varvec{\gamma }^{\text {T}} \varvec{x}_i^* \right\} , \end{aligned}$$

where $\varvec{\gamma }$ can be treated as a vector that contains all the elements in $\varvec{\beta }_j$ for $j=1,\ldots , k-1$, and $\varvec{x}_i^*$ is defined accordingly. Next, by Theorem 10.10 in [35], we have that $\hat{R}_n^* \{ \mathscr {F}(p,k,s) \} \le s \sqrt{\frac{2\log (2pk-2p)}{n}} $. Thus, $\hat{R}_n \{ \mathscr {F}(p,k,s)\} \le \frac{s}{\kappa } \sqrt{\frac{2\log (2pk-2p)}{n}} $ for $L_1$ penalized linear learning.

For $L_2$ penalized learning, the proof is analogous to that of Lemma 8 in [49], and we omit the details here.

For kernel learning, notice that one can include the intercept in the original predictor space (i.e., augment $\varvec{x}$ to include a constant 1 before the other predictors), and define a new kernel function accordingly. This new kernel is also positive definite and separable with a bounded kernel function. By Mercer’s Theorem, this introduces a new RKHS $\mathcal {H}$. Next, by a similar argument as for (18), we have that the original Rademacher complexity is upper bounded by

$$\begin{aligned} \frac{1}{\kappa } \hat{R}_n^* (\mathscr {F})&= \frac{1}{\kappa } E_{\sigma } \left[ \sup _{ \sum _j \Vert f_j\Vert _{\mathcal {H}}^2 \le s } \frac{1}{n} \sum _{i=1}^n \sigma _i \left\{ \sum _{j=1}^{k-1} f_j(\varvec{x}_i) \right\} \right] , \\&\le \frac{k-1}{\kappa } E_{\sigma } \left[ \sup _{ \Vert f \Vert _{\mathcal {H}}^2 \le s } \frac{1}{n} \sum _{i=1}^n \sigma _i f (\varvec{x}_i) \right] , \\&\le \frac{k-1}{\kappa } \frac{s}{\sqrt{n}}, \end{aligned}$$

where the last inequality follows from Theorem 5.5 in [35]. Hence, we have that for kernel learning, $\hat{R}_n (\mathscr {F}) \le \frac{s(k-1)}{\kappa \sqrt{n}} $. $\square $

The proof of Theorem 2 is thus finished by combining Lemmas 2 and 3, and the fact that the continuous indicator function $I_{\kappa }$ is an upper bound of the indicator function for any $\kappa $. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, C., Pham, M., Fu, S. et al. Robust multicategory support vector machines using difference convex algorithm. Math. Program. 169, 277–305 (2018). https://doi.org/10.1007/s10107-017-1209-5

Download citation

Received: 04 February 2016
Accepted: 09 November 2017
Published: 29 November 2017
Issue Date: May 2018
DOI: https://doi.org/10.1007/s10107-017-1209-5

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust multicategory support vector machines using difference convex algorithm

Abstract

Access this article

Similar content being viewed by others

A novel multi-class SVM model using second-order cone constraints

Robust multicategory support matrix machines

DCA Based Algorithms for Feature Selection in Semi-supervised Support Vector Machines

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Proof of Theorem 4

Lemma 1

Proof of Theorem 2

Lemma 2

Lemma 3

Proof of Lemma 2

Proof of Lemma 3

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Robust multicategory support vector machines using difference convex algorithm

Abstract

Access this article

Similar content being viewed by others

A novel multi-class SVM model using second-order cone constraints

Robust multicategory support matrix machines

DCA Based Algorithms for Feature Selection in Semi-supervised Support Vector Machines

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Proof of Theorem 4

Lemma 1

Proof of Theorem 2

Lemma 2

Lemma 3

Proof of Lemma 2

Proof of Lemma 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation