Skip to main content
Log in

Robust multicategory support vector machines using difference convex algorithm

  • Full Length Paper
  • Series B
  • Published:
Mathematical Programming Submit manuscript

Abstract

The support vector machine (SVM) is one of the most popular classification methods in the machine learning literature. Binary SVM methods have been extensively studied, and have achieved many successes in various disciplines. However, generalization to multicategory SVM (MSVM) methods can be very challenging. Many existing methods estimate k functions for k classes with an explicit sum-to-zero constraint. It was shown recently that such a formulation can be suboptimal. Moreover, many existing MSVMs are not Fisher consistent, or do not take into account the effect of outliers. In this paper, we focus on classification in the angle-based framework, which is free of the explicit sum-to-zero constraint, hence more efficient, and propose two robust MSVM methods using truncated hinge loss functions. We show that our new classifiers can enjoy Fisher consistency, and simultaneously alleviate the impact of outliers to achieve more stable classification performance. To implement our proposed classifiers, we employ the difference convex algorithm for efficient computation. Theoretical and numerical results obtained indicate that for problems with potential outliers, our robust angle-based MSVMs can be very competitive among existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Arora, S., Bhattacharjee, D., Nasipuri, M., Malik, L., Kundu, M., Basu, D.K.: Performance Comparison of SVM and ANN for Handwritten Devnagari Character Recognition. arXiv preprint arXiv:1006.5902 (2010)

  2. Bache, K., Lichman, M.: UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml (2013)

  3. Bartlett, P.L., Mendelson, S.: Rademacher and Gaussian complexities: risk bounds and structural results. J. Mach. Learn. Res. 3, 463–482 (2002)

    MathSciNet  MATH  Google Scholar 

  4. Bartlett, P.L., Bousquet, O., Mendelson, S.: Local rademacher complexities. Ann. Stat. 33(4), 1497–1537 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  5. Bartlett, P.L., Jordan, M.I., McAuliffe, J.D.: Convexity, classification, and risk bounds. J. Am. Stat. Assoc. 101, 138–156 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  6. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Haussler, D. (ed.) Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, pp. 144–152. Association for Computing Machinery, New York (1992). https://doi.org/10.1145/130385.130401

    Chapter  Google Scholar 

  7. Caruana, R., Karampatziakis, N., Yessenalina, A.: An empirical evaluation of supervised learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning, pp. 96–103. ACM (2008)

  8. Cortes, C., Vapnik, V.N.: Support vector networks. Mach. Learn. 20, 273–297 (1995)

    MATH  Google Scholar 

  9. Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res. 2, 265–292 (2001)

    MATH  Google Scholar 

  10. Cristianini, N., Shawe-Taylor, J.S.: An Introduction to Support Vector Machines, 1st edn. Cambridge University Press, Cambridge (2000)

    MATH  Google Scholar 

  11. Demšar, J., Curk, T., Erjavec, A., Črt Gorup, Hočevar, T., Milutinovič, M., Možina, M., Polajnar, M., Toplak, M., Starič, A., Štajdohar, M., Umek, L., Žagar, L., Žbontar, J., Žitnik, M., Zupan, B.: Orange: data mining toolbox in python. J. Mach. Learn. Res. 14:2349–2353. http://jmlr.org/papers/v14/demsar13a.html (2013)

  12. Freund, Y., Schapire, R.E.: A Desicion-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)

    Article  MATH  Google Scholar 

  13. Guermeur, Y., Monfrini, E.: A quadratic loss multi-class SVM for which a radius-margin bound applies. Informatica 22(1), 73–96 (2011)

    MathSciNet  MATH  Google Scholar 

  14. Hastie, T.J., Tibshirani, R.J., Friedman, J.H.: The Elements of Statistical Learning, 2nd edn. Springer, New York (2009)

    Book  MATH  Google Scholar 

  15. Hsieh, C., Chang, K., Lin, C., Keerthi, S., Sundarajan, S.: A dual coordinate descent method for large-scale linear SVM. In: Proceedings of the 25th International Conference on Machine Learning, Proceeding ICML ’08, pp. 408–415 (2008)

  16. Justino, E.J.R., Bortolozzi, F., Sabourin, R.: A comparison of SVM and HMM classifiers in the off-line signature verification. Pattern Recognit. Lett. 26(9), 1377–1385 (2005)

    Article  Google Scholar 

  17. Kiwiel, K., Rosa, C., Ruszczynski, A.: Proximal decomposition via alternating linearization. SIAM J. Optim. 9(3), 668–689 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  18. Koltchinskii, V.: Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Stat. 34(6), 2593–2656 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  19. Koltchinskii, V., Panchenko, D.: Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Stat. 30(1), 1–50 (2002)

    MathSciNet  MATH  Google Scholar 

  20. Le Thi, H.A., Pham Dinh, T.: Solving a class of linearly constrained indefinite quadratic problems by DC algorithms. J. Glob. Optim. 11(3), 253–285 (1997)

    Article  MATH  Google Scholar 

  21. Le Thi, H.A., Pham Dinh, T.: The DC (difference of convex functions) programming and DCA revisited with dc models of real world nonconvex optimization problems. Ann. Oper. Res. 133, 23–46 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  22. Le Thi, H.A., Pham Dinh, T.: The State of the Art in DC Programming and DCA. Research Report (60 pages), Lorraine University (2013)

  23. Le Thi, H.A., Pham Dinh, T.: Recent advances in DC programming and DCA. Trans. Comput. Collect. Intell. 8342, 1–37 (2014)

    Google Scholar 

  24. Le Thi, H.A., Le, H.M., Pham Dinh, T.: A dc programming approach for feature selection in support vector machines learning. Adv. Data Anal. Classif. 2(3), 259–278 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  25. Le Thi, H.A., Huynh, V.N., Pham Dinh, T.: DC programming and DCA for general DC programs. Adv. Intell. Syst. Comput. 15–35. ISBN 978-3-319-06568-7 (2014)

  26. Lee, Y., Lin, Y., Wahba, G.: Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data. J. Am. Stat. Assoc. 99, 67–81 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  27. Lin, X., Wahba, G., Xiang, D., Gao, F., Klein, R., Klein, B.: Smoothing spline ANOVA models for large data sets with bernoulli observations and the randomized GACV. Ann. Stat. 28(6), 1570–1600 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  28. Lin, X., Pham, M., Ruszczynski, A.: Alternating linearization for structured regularization problem. J. Mach. Learn. Res. 15, 3447–3481 (2014)

    MathSciNet  MATH  Google Scholar 

  29. Lin, Y.: Some Asymptotic Properties of the Support Vector Machine. Technical Report 1044r, Department of Statistics, University of Wisconsin, Madison (1999)

  30. Liu Y (2007) Fisher consistency of multicategory support vector machines. In: Eleventh International Conference on Artificial Intelligence and Statistics, pp. 289–296

  31. Liu, Y., Shen, X.: Multicategory \(\psi \)-learning. J. Am. Stat. Assoc. 101, 500–509 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  32. Liu, Y., Yuan, M.: Reinforced multicategory support vector machines. J. Comput. Gr. Stat. 20(4), 901–919 (2011)

    Article  MathSciNet  Google Scholar 

  33. Liu, Y., Zhang, H.H., Wu, Y.: Soft or hard classification? Large margin unified machines. J. Am. Stat. Assoc. 106, 166–177 (2011)

    Article  MATH  Google Scholar 

  34. McDiarmid, C.: On the method of bounded differences. In: Surveys in Combinatorics, Cambridge University Press, Cambridge, pp. 148–188 (1989)

  35. Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. MIT Press, Cambridge, MA (2012)

    MATH  Google Scholar 

  36. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(4), 341–362 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  37. Pang, J.S., Razaviyayn, M., Alvarado, A.: Computing B-stationary points of nonsmooth DC programs. Math. Oper. Res. 42, 95–118 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  38. Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Schölkopf, B., Burges, J.C., Smola, A.J. (eds.) Advances in Kernel Methods: Support Vector Learning, pp. 185–208. MIT Press, Cambridge, MA, USA (1999)

    Google Scholar 

  39. Shawe-Taylor, J.S., Cristianini, N.: Kernel Methods for Pattern Analysis, 1st edn. Cambridge University Press, Cambridge (2004)

    Book  MATH  Google Scholar 

  40. Steinwart, I., Scovel, C.: Fast rates for support vector machines using Gaussian kernels. Ann. Stat. 35(2), 575–607 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  41. Tseng, P.: A coordinate gradient descent method for linearly constrained smooth optimization and support vector machines training. J. Comput. Optim. Appl. 47(4), 179–206 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  42. van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes with Application to Statistics, 1st edn. Springer, Berlin, New York, NY (2000)

    MATH  Google Scholar 

  43. Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998)

    MATH  Google Scholar 

  44. Wahba, G.: Support vector machines, reproducing kernel Hilbert spaces and the randomized GACV. In: Schölkopf, B., Burges, J.C., Smola, A.J. (eds.) Advances in Kernel Methods: Support Vector learning, pp. 69–88. MIT Press, Cambridge, MA, USA (1999)

    Google Scholar 

  45. Wang, L., Shen, X.: On \(L_1\)-norm multi-class support vector machines: methodology and theory. J. Am. Stat. Assoc. 102, 595–602 (2007)

    Article  Google Scholar 

  46. Wang, L., Zhu, J., Zou, H.: The doubly regularized support vector machine. Stat. Sin. 16, 589–615 (2006)

    MathSciNet  MATH  Google Scholar 

  47. Wu, Y., Liu, Y.: On multicategory truncated-hinge-loss support vector. In: Prediction and Discovery: AMS-IMS-SIAM Joint Summer Research Conference, Machine and Statistical Learning: Prediction and Discovery, June 25–29, 2006, Snowbird, Utah, American Mathematical Society, vol. 443, pp. 49–58 (2006)

  48. Wu, Y., Liu, Y.: Robust truncated hinge loss support vector machines. J. Am. Stat. Assoc. 102(479), 974–983 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  49. Zhang, C., Liu, Y.: Multicategory angle-based large-margin classification. Biometrika 101(3), 625–640 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  50. Zhang, C., Liu, Y., Wang, J., Zhu, H.: Reinforced angle-based multicategory support vector machines. J. Comput. Gr. Stat. 25, 806–825 (2016)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the reviewers and editors for their helpful comments and suggestions which led to a much improved presentation. Yufeng Liu’s research was supported in part by National Science Foundation Grant IIS1632951 and National Institute of Health Grant R01GM126550. Chong Zhang’s research was supported in part by National Science and Engineering Research Council of Canada (NSERC). Pham was supported in part by National Science Foundation Grant DMS1127914 and the Hobby Postdoctoral Fellowship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yufeng Liu.

Appendix

Appendix

Proof of Theorem 4

To prove the theorem, we need the following lemma, of which the proof can be found in [49].

Lemma 1

([49], Lemma 1) Suppose we have an arbitrary \(\varvec{f}\in \mathbb {R}^{k-1}\). For any \(u, v\in \{1,\ldots ,k\}\) such that \(u \ne v\), define \(\varvec{T}_{u,v}=\varvec{W}_u-\varvec{W}_v\). For any scalar \(z \in \mathbb {R}\), \(\langle (\varvec{f}+z \varvec{T}_{u,v}),\varvec{W}_w \rangle = \langle \varvec{f},\varvec{W}_w \rangle \), where \(w\in \{1,\ldots ,k\}\) and \(w \ne u, v\). Furthermore, we have that \(\langle (\varvec{f}+z \varvec{T}_{u,v}),\varvec{W}_u \rangle - \langle \varvec{f},\varvec{W}_u\rangle = - \langle (\varvec{f}+z \varvec{T}_{v,u}),\varvec{W}_v\rangle + \langle \varvec{f},\varvec{W}_v\rangle \).

We first prove that the loss function (3) is Fisher consistent with \(\gamma \le 1/2\) and \(s \le 0\). Assume that, without loss of generality, \(P_1 > P_2 \ge \cdots \ge P_k\). The goal is to show that \(\langle \varvec{f}^*, \varvec{W}_1 \rangle > \langle \varvec{f}^*, \varvec{W}_j \rangle \) for any \(j \ne 1\). Inspired by the proof in [32, 47], we complete our proof with four steps.

The first step is to show that there exists a theoretical minimizer \(\varvec{f}^*\) such that \(\langle \varvec{f}^*, \varvec{W}_j \rangle \le k-1\) for all j. Otherwise, suppose \(\langle \varvec{f}^*, \varvec{W}_i \rangle > k-1\) for some i, there must exist \(q \in \{1,\ldots ,k\},~q\ne i\), such that \(\langle \varvec{f}^*, \varvec{W}_q \rangle < -1\), by the sum-to-zero property of the inner products \(\sum _{j=1}^k \langle \varvec{f}^*, \varvec{W}_j \rangle = 0\). Now one can decrease \(\langle \varvec{f}^*, \varvec{W}_i \rangle \) by a small amount, and increase \(\langle \varvec{f}^*, \varvec{W}_q \rangle \) by the same amount (Lemma 1), and the variation of the conditional loss depends on s. Specifically, if \(s<-\frac{1}{k-1}\), the conditional loss is decreased, which is a contradiction to the definition of \(\varvec{f}^*\). If \(-\frac{1}{k-1}\le s\le 0\), the conditional loss remains the same. However, one can keep doing this till all \(\langle \varvec{f}^*, \varvec{W}_j \rangle \le k-1\), which is a contradiction to the assumption \(\langle \varvec{f}^*, \varvec{W}_i \rangle > k-1\).

The second step is to show that \(\langle \varvec{f}^*, \varvec{W}_1 \rangle \ge \langle \varvec{f}^*, \varvec{W}_j \rangle \) for any \(j\ne 1\) using contradiction. Suppose \(\langle \varvec{f}^*, \varvec{W}_1 \rangle < \langle \varvec{f}^*, \varvec{W}_i \rangle \) for some i. By the definition of \(S(\varvec{x}) = E[ \phi _1\{\varvec{f}(\varvec{X}),Y\} \mid \varvec{X}=\varvec{x}]\), we can simplify it as

$$\begin{aligned} S(\varvec{x}) \triangleq \sum _{j=1}^k P_j h_\gamma (\langle \varvec{f},~\varvec{W}_{j}\rangle )+(1-\gamma )\sum _{j=1}^k R_{s}(\langle \varvec{f},~\varvec{W}_{j}\rangle ). \end{aligned}$$

where \(h_\gamma (u)= \gamma T_{(k-1)s}(u)-(1-\gamma )R_{s}(u)\). Because \(h_\gamma (u)\) is monotone decreasing for any \(0\le \gamma \le 1\), we have \(h_\gamma (\langle \varvec{f}^*,~\varvec{W}_{1}\rangle )\ge h_\gamma (\langle \varvec{f}^*,~\varvec{W}_{i}\rangle )\). We claim that \(h_\gamma (\langle \varvec{f}^*,~\varvec{W}_{1}\rangle )\le h_\gamma (\langle \varvec{f}^*,~\varvec{W}_{j}\rangle )\) for all \(j\ne 1\). If it is not true, there must exist \(h_\gamma (\langle \varvec{f}^*,~\varvec{W}_{1}\rangle )> h_\gamma (\langle \varvec{f}^*,~\varvec{W}_{j}\rangle )\) for some j. Then we can define \(\varvec{f}^{\prime }(\varvec{x}) \in \mathbb {R}^{k-1}\) such that \(\langle \varvec{f}^*, \varvec{W}_1 \rangle = \langle \varvec{f}^{\prime }, \varvec{W}_j \rangle \) and \(\langle \varvec{f}^*, \varvec{W}_j \rangle = \langle \varvec{f}^{\prime }, \varvec{W}_1 \rangle \) (the existence of such \(\varvec{f}^{\prime }\) is guaranteed by Lemma 1). One can verify that \(\varvec{f}^{\prime }\) is the minimizer of \(S(\varvec{x})\), not \(\varvec{f}^*\). This contradicts with the definition of \(\varvec{f}^*\). Therefore, we obtain \(h_\gamma (\langle \varvec{f}^*,~\varvec{W}_{1}\rangle )=h_\gamma (\langle \varvec{f}^*,~\varvec{W}_{i}\rangle )\). Because \(h_\gamma (\cdot )\) is flat in \((-\infty ,~\min (-1,(k-1)s)]\) and \([\max (k-1,-s),~+\infty )\), \(\langle \varvec{f}^*,~\varvec{W}_{1}\rangle ~\text {and}~\langle \varvec{f}^*,~\varvec{W}_{i}\rangle \) lie in the same interval simultaneously. If \(\langle \varvec{f}^*,~\varvec{W}_{1}\rangle ,~\langle \varvec{f}^*,~\varvec{W}_{i}\rangle \in (-\infty ,~\min (-1,(k-1)s)]<0\), then all \(\langle \varvec{f}^*,~\varvec{W}_{j}\rangle <0\), which is a contradiction to the sum-to-zero property. Thus \(\langle \varvec{f}^*,~\varvec{W}_{1}\rangle ,~\langle \varvec{f}^*,~\varvec{W}_{i}\rangle \in [\max (k-1,-s),~+\infty )\). If \(s<-(k-1)\), then \(-s\le \langle \varvec{f}^*,~\varvec{W}_{1}\rangle \le k-1\), which is a contradiction. If \(0\ge s\ge -(k-1)\), based on the fact that \(\langle \varvec{f}^*,~\varvec{W}_{j}\rangle \le k-1\) for all j, then \(\langle \varvec{f}^*,~\varvec{W}_{1}\rangle =\langle \varvec{f}^*,~\varvec{W}_{i}\rangle =k-1\), which contradicts with the assumption. Hence, we must have that \(\langle \varvec{f}^*, \varvec{W}_1 \rangle \ge \langle \varvec{f}^*, \varvec{W}_j \rangle \) for all \(j\ne 1\).

The third step is to show that when \(\gamma \le 1/2\), \(\langle \varvec{f}^*, \varvec{W}_j \rangle \ge -1\) for all j. Suppose this is not true, and \(\langle \varvec{f}^*, \varvec{W}_i \rangle < -1\) for some \(i \ne 1\). There must exist \(q \in \{1,\ldots ,k\}, q \ne 1, i\), such that \(-1< \langle \varvec{f}^*, \varvec{W}_q \rangle \le k-1 \). In this case, we can decrease \(\langle \varvec{f}^*, \varvec{W}_q \rangle \) by a small amount and increase \(\langle \varvec{f}^*, \varvec{W}_i \rangle \) by the same amount, such that the conditional loss \(S(\varvec{x})\) is decreased, which contradicts with the optimality of \(\varvec{f}^*\).

The last step is to show that \(\langle \varvec{f}^*, \varvec{W}_j \rangle \le 0\) for any \(j\ne 1\). If \(\langle \varvec{f}^*, \varvec{W}_i\rangle >0\) for some i, then \(\langle \varvec{f}^*, \varvec{W}_1\rangle < k-1\). Otherwise, \(\langle \varvec{f}^*, \varvec{W}_1 \rangle = k-1\). According to third part, we have \(\langle \varvec{f}^*, \varvec{W}_j\rangle =-1,~j\ne 1\). Especially, \(\langle \varvec{f}^*, \varvec{W}_i\rangle =-1<0\), which is a contradiction to the assumption \(\langle \varvec{f}^*, \varvec{W}_i\rangle >0\). Then \(\langle \varvec{f}^*, \varvec{W}_1 \rangle < k-1\). Now we can decrease \(\langle \varvec{f}^*, \varvec{W}_i \rangle \) by a small amount, and increase \(\langle \varvec{f}^*, \varvec{W}_1 \rangle \) by the same amount (Lemma 1), and the conditional loss is decreased. It indicates that the current \(\varvec{f}^*\) is not optimal, which is a contradiction. Based on the sum-to-zero property, we show the fact \(\langle \varvec{f}^*, \varvec{W}_1 \rangle >0\ge \langle \varvec{f}^*, \varvec{W}_j \rangle \) for any \(j\ne 1\). This completes the first part of the proof, that (3) is Fisher consistent when \(\gamma \le 1/2\) and \(s \le 0\).

We proceed to show the Fisher consistency of (6) when \(s \in [-1/(k-1), 0]\). Again, without loss of generality, \(P_1 > P_2 \ge \cdots \ge P_k\). By similar arguments as above, one can verify that \(\langle \varvec{f}^*, \varvec{W}_1 \rangle \ge 0\), and \(\langle \varvec{f}^*, \varvec{W}_i \rangle \ge \langle \varvec{f}^*, \varvec{W}_j \rangle \) if \(i < j\). Therefore, it remains to show that \(\varvec{0}\) is not the minimizer for the choice of s. To this end, notice that for \(s \in [-\frac{1}{k-1},0]\), we have \(\frac{1}{-s} \ge k-1\). Consequently, there exists \(t \in (0,1]\) such that \(\frac{t}{-s} \ge k-1\). Consider a \(\varvec{f}\) such that \(\langle \varvec{f}, \varvec{W}_1\rangle =\frac{t(k-1)}{k}\) and \(\langle \varvec{f}, \varvec{W}_j \rangle =-\frac{t}{k}\) for \(j\ne 1\). One can verify that \(\varvec{f}\) yields a smaller conditional loss, compared to \(\varvec{0}\). Thus, the robust SVM (6) is Fisher consistent.\(\square \)

Proof of Theorem 2

We can prove the theorem using a recent technique in the statistical machine learning literature, namely, the Rademacher complexity [3, 4, 18, 19, 35, 39]. To begin with, let \(\sigma = \{\sigma _i; \ i=1,\ldots ,n \}\) be independent and identically distributed random variables, that take 1 and \(-1\) with probability 1/2 each. Denote by S a sample of observations \((\varvec{x}_i,y_i); \ i=1,\ldots ,n \), independent and identically distributed from the underlying distribution \(P(\varvec{X},Y)\). For a function class \(\mathscr {F}= \{f:f(\varvec{x},y)\}\) and given S, we define the empirical Rademacher complexity of \(\mathscr {F}\) to be

$$\begin{aligned} \hat{R}_n (\mathscr {F}) = E_{\sigma } \left\{ \sup _{f \in \mathscr {F}} \frac{1}{n} \sum _{i=1}^n \sigma _i f(\varvec{x}_i,y_i)\right\} . \end{aligned}$$

Here \(E_{\sigma }\) means taking expectation with respect to the distribution of \(\sigma \). Furthermore, define the Rademacher complexity of \(\mathscr {F}\) to be

$$\begin{aligned} R_n(\mathscr {F}) = E_{\sigma , S} \left\{ \sup _{f \in \mathscr {F}} \frac{1}{n} \sum _{i=1}^n \sigma _i f(\varvec{x}_i,y_i)\right\} . \end{aligned}$$

Another key step in the proof is to notice that the indicator function in (16) is discontinuous, thus it is difficult to bound the corresponding Rademacher complexity directly. To overcome this challenge, we can consider a continuous upper bound of the indicator function. In particular, for any \(\hat{\varvec{f}}\), let \(I_{\kappa }\) be defined as follows

$$\begin{aligned} I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}), y\} = \left\{ \begin{array}{ll} 1 &{} \text {if } y \ne \hat{y}_{\hat{\varvec{f}}} (\varvec{x}) ,\\ 1 - \frac{1}{\kappa } \min _{j \ne y} \langle \hat{\varvec{f}}(\varvec{x}), \varvec{W}_y-\varvec{W}_j \rangle &{} \text {if } y = \hat{y}_{\hat{\varvec{f}}} (\varvec{x}) \text { and } \\ &{}\quad \min _{j \ne y} \langle \hat{\varvec{f}}(\varvec{x}), \varvec{W}_y-\varvec{W}_j \rangle \le \kappa , \\ 0 &{} \text {otherwise}, \end{array} \right. \end{aligned}$$

where \(\kappa \) is a small positive number to be determined later. One can verify that \(I_{\kappa }\) is a continuous upper bound of the indicator function in (16). In the following proof, we focus on bounding the Rademacher complexity of \(I_{\kappa }\), where \(\hat{\varvec{f}}\) is obtained from the optimization problems (3) and (6).

Our goal is to show that with probability at least \(1-\delta \) (\(0< \delta <1\)), \(E [ I_{\kappa } \{\hat{\varvec{f}}(\varvec{X}), Y\} ]\) is bounded by the summation of its empirical evaluation, the Rademacher complexity of the function class \(\mathscr {F}\), and a penalty term on \(\delta \). The proof of Theorem 2 consists of two major steps. In particular, we have the following two lemmas.

Lemma 2

Let \(R_n(\mathscr {F})\) and \(\hat{R}_n(\mathscr {F})\) be defined with respect to the \(I_{\kappa }\) function. Then, with probability at least \(1-\delta \),

$$\begin{aligned} E [ I_{\kappa } \{\hat{\varvec{f}}(\varvec{X}), Y\} ] \le \frac{1}{n} \sum _{i=1}^n I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\} + 2 R_n (\mathscr {F}) + T_n (\delta ), \end{aligned}$$
(17)

where \(T_n (\delta ) = \{\log (1/\delta ) / n\}^{1/2} \).

Moreover, with probability at least \(1-\delta \),

$$\begin{aligned} E [ I_{\kappa } \{\hat{\varvec{f}}(\varvec{X}), Y\} ] \le \frac{1}{n} \sum _{i=1}^n I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\} + 2 \hat{R}_n (\mathscr {F}) + 3T_n (\delta /2). \end{aligned}$$

Lemma 3

Let \(s = 1/\lambda \). In linear learning, when we use the \(L_1\) penalty, the empirical Rademacher complexity \(\hat{R}_n (\mathscr {F}) \le \frac{s}{\kappa } \sqrt{\frac{2\log (2pk-2p)}{n}}\), and when we use the \(L_2\) penalty, \(\hat{R}_n (\mathscr {F}) \le \{2(k-1) (ps)^{1/2}\}/( \kappa n^{1/4}) + \{2 (ps)^{1/2}\} \Big (\log [e+e\{2p(k-1)\} ]/(n^{1/2})\Big )^{1/2}/( \kappa n^{1/4})\). For kernel learning with separable kernel functions, the empirical Rademacher complexity \(\hat{R}_n (\mathscr {F}) \le \frac{s(k-1)}{\kappa \sqrt{n}} \).

Proof of Lemma 2

The proof consists of three parts. For the first part, we use the McDiarmid inequality [34] to bound the left hand side of (17), in terms of its empirical estimation, plus the expectation of their supremum difference, \(E (\phi )\), which is to be defined below. For the second part, we show that \(E (\phi )\) is bounded by the Rademacher complexity using symmetrization inequalities [42]. For the third part, we prove that one can bound the Rademacher complexity using the empirical Rademacher complexity.

For a given sample S, we define

$$\begin{aligned} \phi (S) = \sup _{\varvec{f}\in \mathscr {F}} \left( E [ I_{ \kappa } \{\hat{\varvec{f}}(\varvec{X}), Y\} ] - \frac{1}{n} \sum _{i=1}^n I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\} \right) . \end{aligned}$$

Let \(S^{(i,\varvec{x})} = \{ (\varvec{x}_1,y_1), \ldots , (\varvec{x}_i', y_i), \ldots , (\varvec{x}_n,y_n)\}\) be another sample from \(P(\varvec{X},Y)\), where the difference between S and \(S^{(i,\varvec{x})}\) is only on the \(\varvec{x}\) value of their ith pair. By definition, we have

$$\begin{aligned} |\phi (S) - \phi (S^{(i,\varvec{x})})|&= \left| \sup _{\varvec{f}\in \mathscr {F}} \left( E [ I_{ \kappa } \{\hat{\varvec{f}}(\varvec{X}), Y\} ] - \frac{1}{n} \sum _S I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\} \right) \right. \\&\quad - \left. \sup _{\varvec{f}\in \mathscr {F}} \left( E [ I_{ \kappa } \{\hat{\varvec{f}}(\varvec{X}), Y\} ] - \frac{1}{n} \sum _{S^{(i,\varvec{x})}} I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\} \right) \right| . \end{aligned}$$

For simplicity, suppose that \(\varvec{f}^S\) is the function that achieves the supremum of \(\phi (S)\). We note that the case of no function achieving the supremum can be treated analogously, with only additional discussions on the arbitrarily small difference between \(\phi (\varvec{f})\) and its supremum. Thus, we omit the details here. We have that,

$$\begin{aligned} |\phi (S) - \phi (S^{(i,\varvec{x})})| \le&\left| E [ I_{ \kappa } \{ \varvec{f}^S (\varvec{X}), Y\}] - \frac{1}{n} \sum _S I_{\kappa } \{ \varvec{f}^S (\varvec{x}_i), y_i\} \right. \\&- \left. E [ I_{ \kappa } \{ \varvec{f}^S (\varvec{X}), Y\} ] + \frac{1}{n} \sum _{S^{(i,\varvec{x})}} I_{\kappa } \{ \varvec{f}^S (\varvec{x}_i), y_i\} \right| . \\ =\,&\frac{1}{n} \left| \sum _S I_{\kappa } \{ \varvec{f}^S (\varvec{x}_i), y_i\} - \sum _{S^{(i,\varvec{x})}} I_{\kappa } \{ \varvec{f}^S (\varvec{x}_i), y_i\} \right| \\ \le \,&\frac{1}{n}. \end{aligned}$$

Next, by the McDiarmid inequality, we have that for any \(t>0\), \(\text {pr}[\phi (S) - E\{\phi (S)\} \ge t] \le \exp [-(2t^2)/\{2n(1/n)^2\}]\), or equivalently, with probability at least \(1-\delta \), \(\phi (S) - E\{\phi (S)\} \le T_n(\delta )\). Consequently, we have that with probability at least \(1-\delta \), \(E [ I_{ \kappa } \{ \varvec{f}^S (\varvec{X}), Y\}] \le n^{-1} \sum _{i=1}^n I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\} + E\{\phi (S)\} + T_n(\delta )\). This completes the first part of the proof.

For the second part, we bound \(E\{\phi (S)\}\) by the corresponding Rademacher complexity. To this end, define \(S^{\prime }= \{(\varvec{x}_i',y_i'); \ i=1,\ldots ,n \}\) as an independent duplicate sample of size n with the identical distribution as S. Denote by \(E_S\) the action of taking expectation with respect to the distribution of S, and define \(E_{S^{\prime }}\) analogously. By definition, we have that \(E_{S^{\prime }} \big [ n^{-1} \sum _{S^{\prime }} I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i^{\prime }), y_i^{\prime }\} \mid S \big ] = E [ I_{\kappa } \{\hat{\varvec{f}}(\varvec{X}), Y\} ]\), and \(E_{S^{\prime }} \big [ n^{-1} \sum _{S} I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\} \mid S \big ] = n^{-1} \sum _{S} I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\}\). Then, by Jensen’s inequality and the property of \(\sigma \), we have that

$$\begin{aligned} E\{\phi (S)\}&= E_S \left( \sup _{\varvec{f}\in \mathscr {F}} E_{S^{\prime }} \left[ \frac{1}{n} \sum _{S^{\prime }} I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i^{\prime }), y_i^{\prime }\} - \frac{1}{n} \sum _{S} I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\}\right] \mid S \right) \\&\le E_{S,S^{\prime }} \left[ \sup _{\varvec{f}\in \mathscr {F}} \frac{1}{n} \sum _{S^{\prime }} I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i^{\prime }), y_i^{\prime }\} - \frac{1}{n} \sum _{S} I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\} \right] \\&= E_{S,S^{\prime },\sigma } \big [\sup _{\varvec{f}\in \mathscr {F}} \frac{1}{n} \sum _{S^{\prime }} \sigma _i I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i^{\prime }), y_i^{\prime }\} - \frac{1}{n} \sum _{S} \sigma _i I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\} \big ] \\&\le 2R_n \{ \mathscr {F}(p,k,s)\}. \end{aligned}$$

Hence the second part is proved.

In the third step, we need to bound \(R_n (\mathscr {F})\) using \(\hat{R}_n (\mathscr {F})\). This step is analogous to the first part, and we omit the details here. Briefly speaking, one can apply the McDiarmid inequality on \(\hat{R}_n (\mathscr {F})\) and the corresponding expectation \(R_n (\mathscr {F}) \). Similar to the first part of this proof, we can show that with probability at least \(1-\delta \), \(R_n (\mathscr {F}) \le \hat{R}_n (\mathscr {F}) + 2 T_n(\delta )\).

The final results of Lemma 2 can be obtained by choosing the confidence \(1-\delta /2\) in the first and third steps, and combining the inequalities of the three steps. \(\square \)

Proof of Lemma 3

First, we prove that for the obtained \(\hat{\varvec{f}}\), \(J(\hat{\varvec{f}}) \le s\). To see this, notice that for \(\varvec{\beta }_j=0\) and \(\beta _{j,0} = 0\), we have

$$\begin{aligned} \frac{1}{n} \sum _{i=1}^n I_{\kappa } \{\varvec{f}(\varvec{x}_i), y_i\} \le 1. \end{aligned}$$

On the other hand, \(\hat{\varvec{f}}\) is the solution to the optimization problems in (3) or (6), hence

$$\begin{aligned} \lambda J(\hat{\varvec{f}}) \le \frac{1}{n} \sum _{i=1}^n I_{\kappa } \{\hat{\varvec{f}}(\varvec{x}_i), y_i\} + \lambda J(\hat{\varvec{f}}) \le \frac{1}{n} \sum _{i=1}^n I_{\kappa } \{\varvec{f}(\varvec{x}_i), y_i\}, \end{aligned}$$

which yields \(J(\hat{\varvec{f}}) \le s\).

For the \(L_1\) penalized learning, one can bound the corresponding Rademacher complexity \(\hat{R}_n (\mathscr {F})\) in the following way. In particular, by Lemma 4.2 in [35], we have that \(\hat{R}_n (\mathscr {F})\) is upper bounded by

$$\begin{aligned} \frac{1}{\kappa } \hat{R}_n^* (\mathscr {F}) = \frac{1}{\kappa } E_{\sigma } \left\{ \sup _{ \sum _{j=1}^{k-1} \Vert \varvec{\beta }_j\Vert _1 < s } \frac{1}{n} \sum _{i=1}^n \sigma _i \left\{ \sum _{j=1}^{k-1} \varvec{x}_i^{\text {T}} \varvec{\beta }_j\right\} \right\} , \end{aligned}$$
(18)

because the continuous indicator function is Lipschitz with constant \(1/\kappa \), and elements in \(\varvec{W}_j\) are bounded by 1. Without loss of generality, we can rewrite (18) as the following

$$\begin{aligned} \frac{1}{\kappa } \hat{R}_n^* \{ \mathscr {F}(p,k,s) \} = \frac{1}{\kappa } E_{\sigma } \left\{ \sup _{ \Vert \varvec{\gamma }\Vert _1 < s } \frac{1}{n} \sum _{i=1}^n \sigma _i \varvec{\gamma }^{\text {T}} \varvec{x}_i^* \right\} , \end{aligned}$$

where \(\varvec{\gamma }\) can be treated as a vector that contains all the elements in \(\varvec{\beta }_j\) for \(j=1,\ldots , k-1\), and \(\varvec{x}_i^*\) is defined accordingly. Next, by Theorem 10.10 in [35], we have that \(\hat{R}_n^* \{ \mathscr {F}(p,k,s) \} \le s \sqrt{\frac{2\log (2pk-2p)}{n}} \). Thus, \(\hat{R}_n \{ \mathscr {F}(p,k,s)\} \le \frac{s}{\kappa } \sqrt{\frac{2\log (2pk-2p)}{n}} \) for \(L_1\) penalized linear learning.

For \(L_2\) penalized learning, the proof is analogous to that of Lemma 8 in [49], and we omit the details here.

For kernel learning, notice that one can include the intercept in the original predictor space (i.e., augment \(\varvec{x}\) to include a constant 1 before the other predictors), and define a new kernel function accordingly. This new kernel is also positive definite and separable with a bounded kernel function. By Mercer’s Theorem, this introduces a new RKHS \(\mathcal {H}\). Next, by a similar argument as for (18), we have that the original Rademacher complexity is upper bounded by

$$\begin{aligned} \frac{1}{\kappa } \hat{R}_n^* (\mathscr {F})&= \frac{1}{\kappa } E_{\sigma } \left[ \sup _{ \sum _j \Vert f_j\Vert _{\mathcal {H}}^2 \le s } \frac{1}{n} \sum _{i=1}^n \sigma _i \left\{ \sum _{j=1}^{k-1} f_j(\varvec{x}_i) \right\} \right] , \\&\le \frac{k-1}{\kappa } E_{\sigma } \left[ \sup _{ \Vert f \Vert _{\mathcal {H}}^2 \le s } \frac{1}{n} \sum _{i=1}^n \sigma _i f (\varvec{x}_i) \right] , \\&\le \frac{k-1}{\kappa } \frac{s}{\sqrt{n}}, \end{aligned}$$

where the last inequality follows from Theorem 5.5 in [35]. Hence, we have that for kernel learning, \(\hat{R}_n (\mathscr {F}) \le \frac{s(k-1)}{\kappa \sqrt{n}} \). \(\square \)

The proof of Theorem 2 is thus finished by combining Lemmas 2 and 3, and the fact that the continuous indicator function \(I_{\kappa }\) is an upper bound of the indicator function for any \(\kappa \). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, C., Pham, M., Fu, S. et al. Robust multicategory support vector machines using difference convex algorithm. Math. Program. 169, 277–305 (2018). https://doi.org/10.1007/s10107-017-1209-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-017-1209-5

Keywords

Mathematics Subject Classification

Navigation