Machine Learning

, Volume 98, Issue 3, pp 369–406 | Cite as

An efficient primal dual prox method for non-smooth optimization

  • Tianbao Yang
  • Mehrdad Mahdavi
  • Rong Jin
  • Shenghuo Zhu


We study the non-smooth optimization problems in machine learning, where both the loss function and the regularizer are non-smooth functions. Previous studies on efficient empirical loss minimization assume either a smooth loss function or a strongly convex regularizer, making them unsuitable for non-smooth optimization. We develop a simple yet efficient method for a family of non-smooth optimization problems where the dual form of the loss function is bilinear in primal and dual variables. We cast a non-smooth optimization problem into a minimax optimization problem, and develop a primal dual prox method that solves the minimax optimization problem at a rate of \(O(1/T)\) assuming that the proximal step can be efficiently solved, significantly faster than a standard subgradient descent method that has an \(O(1/\sqrt{T})\) convergence rate. Our empirical studies verify the efficiency of the proposed method for various non-smooth optimization problems that arise ubiquitously in machine learning by comparing it to the state-of-the-art first order methods.


Non-smooth optimization Primal dual method Convergence rate Sparsity Efficiency 


  1. Argyriou, A., Evgeniou, T., & Pontil, M. (2008). Convex multi-task feature learning. Machine Learning, 73, 243–272.CrossRefGoogle Scholar
  2. Bach, F., Jenatton, R., & Mairal, J. (2011). Optimization with sparsity-inducing penalties (foundations and trends(R) in machine learning). Hanover, MA: Now Publishers Inc.MATHGoogle Scholar
  3. Bartlett, P. L., & Wegkamp, M. H. (2008). Classification with a reject option using a hinge loss. JMLR, 9, 1823–1840.MathSciNetMATHGoogle Scholar
  4. Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2, 183–202.MathSciNetCrossRefMATHGoogle Scholar
  5. Radu loan Bot Ernö Robert Csetnek, A.H. (2012). A primal-dual splitting algorithm for finding zeros of sums of maximally monotone operators. ArXiv e-prints.Google Scholar
  6. Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundation and Trends in Machine Learning, 3, 1–122.CrossRefMATHGoogle Scholar
  7. Boyd, S., & Vandenberghe, L. (2004). Convex optimization. New York: Cambridge University Press.CrossRefMATHGoogle Scholar
  8. Bredies, K. (2009). A forward-backward splitting algorithm for the minimization of non-smooth convex functionals in banach space. Inverse Problems 25, Article ID 015,005, p 20.Google Scholar
  9. Cai, Y., Sun, Y., Cheng, Y., Li, J., Goodison, S. (2010). Fast implementation of l1 regularized learning algorithms using gradient descent methods. In SDM, pp. 862–871.Google Scholar
  10. Candès, E.J., Recht, B. (2008). Exact matrix completion via convex optimization. CoRR abs/0805.4471.Google Scholar
  11. Chambolle, A., & Pock, T. (2011). A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40, 120–145.MathSciNetCrossRefMATHGoogle Scholar
  12. Chen, X., Pan, W., Kwok, J.T., Carbonell, J.G. (2009). Accelerated gradient method for multitask sparse learning problem. In ICDM, pp. 746–751.Google Scholar
  13. Combettes, P.L., Pesquet, J.C.: Primal-dual splitting algorithm for solving inclusions with mixtures of composite, Lipschitzian, and parallel-sum monotone operators.
  14. Dekel, O., Singer, Y. (2006). Support vector machines on a budget. In NIPS, pp. 345–352.Google Scholar
  15. Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T. (2008). Efficient projections onto the l1-ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, pp. 272–279.Google Scholar
  16. Duchi, J., & Singer, Y. (2009). Efficient online and batch learning using forward backward splitting. JMLR, 10, 2899–2934.MathSciNetMATHGoogle Scholar
  17. Esser, E., Zhang, X., & Chan, T. F. (2010). A general framework for a class of first order Primal-Dual algorithms for convex optimization in imaging science. SIAM Journal of Imaging Sciences, 3, 1015–1046.MathSciNetCrossRefMATHGoogle Scholar
  18. Fung, G., Mangasarian, O.L. (2002). A feature selection newton method for support vector machine classification. Technical report, Computational Optimization and Applications.Google Scholar
  19. Gneiting, T. (2008). Quantiles as optimal point predictors. Technical report: Department of Statistics, University of Washington.Google Scholar
  20. Hastie, T., Tibshirani, R., & Friedman, J. (2008). The elements of statistical learning: Data mining, inference and prediction. Heidelberg: Springer.MATHGoogle Scholar
  21. He, B., & Yuan, X. (2012). Convergence analysis of primal-dual algorithms for a saddle-point problem: From contraction perspective. SIAM Journal on Imaging Science, 5, 119–149.MathSciNetCrossRefMATHGoogle Scholar
  22. Hou, C., Nie, F., Yi, D., Wu, Y. (2011). Feature selection via joint embedding learning and sparse regression. In IJCAI, pp. 1324–1329.Google Scholar
  23. Hsieh, C.J., Chang, K.W., Lin, C.J., Keerthi, S.S., Sundararajan, S. (2008). A dual coordinate descent method for large-scale linear svm. In ICML, pp. 408–415.Google Scholar
  24. Hu, C., Kwok, J., Pan, W. (2009). Accelerated gradient methods for stochastic optimization and online learning. In NIPS, pp. 781–789.Google Scholar
  25. Huang, K., Jin, R., Xu, Z., Liu, C.L. (2010) Robust metric learning by smooth optimization. In UAI, pp. 244–251.Google Scholar
  26. Ji, S., Ye, J. (2009). An accelerated gradient method for trace norm minimization. In ICML, pp. 457–464.Google Scholar
  27. Joachims, T. (1999). Making large-scale support vector machine learning practical. In Advances in Kernel methods: Support vector, learning, pp. 169–184.Google Scholar
  28. Joachims, T. (2006). Training linear svms in linear time. In KDD, pp. 217–226.Google Scholar
  29. Koenker, R. (2005). Quantile regression. Cambridge: Cambridge University Press.CrossRefMATHGoogle Scholar
  30. Lan, G. (2010). An optimal method for stochastic composite optimization. Mathematical Programming, 133, 365–397.CrossRefMATHMathSciNetGoogle Scholar
  31. Lan, G., Lu, Z., & Monteiro, R. D. C. (2011). Primal-dual first-order methods with 1/epsilon iteration-complexity for cone programming. Mathematical Programming, 126, 1–29.MathSciNetCrossRefMATHGoogle Scholar
  32. Lin, Q. (2010). A smoothing stochastic gradient method for composite optimization. ArXiv e-prints.Google Scholar
  33. Lions, P. L., & Mercier, B. (1979). Splitting algorithms for the sum of two nonlinear operators. Siam Journal on Numerical Analysis, 16, 964–979.MathSciNetCrossRefMATHGoogle Scholar
  34. Liu, J., Ji, S., Ye, J. (2009). Multi-task feature learning via efficient l2, 1-norm minimization. In UAI, pp. 339–348.Google Scholar
  35. Mosci, S., Villa, S., Verri, A., Rosasco, L. (2010). A primal-dual algorithm for group sparse regularization with overlapping groups. In NIPS, pp. 2604–2612.Google Scholar
  36. Nemirovski, A. (2005). Prox-method with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15, 229–251.MathSciNetCrossRefMATHGoogle Scholar
  37. Nesterov, Y. (2005). Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization, 16, 235–249.MathSciNetCrossRefMATHGoogle Scholar
  38. Nesterov, Y. (2005). Smooth minimization of non-smooth functions. Mathematical Programming, 103, 127–152.MathSciNetCrossRefMATHGoogle Scholar
  39. Nesterov, Y. (2007). Gradient methods for minimizing composite objective function. Core discussion papers.Google Scholar
  40. Nie, F., Huang, H., Cai, X., & Ding, C. (2010). Efficient and robust feature selection via joint 2,1-norms minimization. Advances in Neural Information Processing Systems, 23, 1813–1821.Google Scholar
  41. Platt, J.C. (1998). Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel methods: Support vector learning, pp. 185–208. Cambridge, MA.Google Scholar
  42. Pock, T., Chambolle, A. (2011). Diagonal preconditioning for first order primal-dual algorithms in convex optimization. In Proceedings of the 2011 International Conference on Computer Vision, pp. 1762–1769.Google Scholar
  43. Popov, L. (1980). A modification of the arrow-hurwitz method of search for saddle points. Matematicheskie Zametki, 28, 777–784.MathSciNetMATHGoogle Scholar
  44. Quattoni, A., Carreras, X., Collins, M., Darrell, T. (2009). An efficient projection for l1, infinity regularization. In ICML, pp. 857–864.Google Scholar
  45. Recht, B., Fazel, M., & Parrilo, P. A. (2010). Guaranteed Minimum-Rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52, 471–501.MathSciNetCrossRefMATHGoogle Scholar
  46. Rennie, J.D.M., Srebro, N. (2005). Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd international conference on Machine learning, pp. 713–719.Google Scholar
  47. Rockafellar, R. T. (1976). Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 14, 877–898.MathSciNetCrossRefMATHGoogle Scholar
  48. Rosasco, L., De Vito, E., Caponnetto, A., Piana, M., & Verri, A. (2004). Are loss functions all the same? Neural Computation, 16, 1063–1076.CrossRefMATHGoogle Scholar
  49. Shalev-Shwartz, S., Singer, Y., Srebro, N., & Cotter, A. (2011). Pegasos: Primal estimated sub-gradient solver for svm. Mathematical Programming, 127(1), 3–30.MathSciNetCrossRefMATHGoogle Scholar
  50. Smale, S., & Zhou, D. X. (2003). Estimating the approximation error in learning theory. Applied Analysis (Singapore), 1(1), 17–41.MathSciNetCrossRefMATHGoogle Scholar
  51. Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14, 199–222.MathSciNetCrossRefGoogle Scholar
  52. Srebro, N., Rennie, J.D.M., Jaakkola, T.S. (2005). Maximum-margin matrix factorization. In Advances in neural information processing systems, pp. 1329–1336.Google Scholar
  53. Sun, L., Liu, J., Chen, J., & Ye, J. (2009). Efficient recovery of jointly sparse vectors. Advances in Neural Information Processing Systems, 22, 1812–1820.Google Scholar
  54. Traub, J. F., Wasilkowski, G. W., & Woźniakowski, H. (1988). Information-based complexity. San Diego: Academic Press Professional Inc.MATHGoogle Scholar
  55. Tseng, P. (2008). On accelerated proximal gradient methods for convex–concave optimization. Technical report.Google Scholar
  56. Vapnik, V. (1998). Statistical learning theory. New York: Wiley-Interscience.MATHGoogle Scholar
  57. Wu, Q., & Zhou, D. X. (2005). Svm soft margin classifiers: Linear programming versus quadratic programming. Neural Computation, 17, 1160–1187.MathSciNetCrossRefMATHGoogle Scholar
  58. Xiao, L. (2009). Dual averaging method for regularized stochastic learning and online optimization. In: Y. Bengio, D. Schuurmans, J. Lafferty, C.K.I. Williams, A. Culotta (eds.) NIPS, pp. 2116–2124.Google Scholar
  59. Yang, H., Xu, Z., King, I., Lyu, M.R. (2010) Online learning for group lasso. In ICML, pp. 1191–1198.Google Scholar
  60. Yang, T., Mahdavi, M., Jin, R., Zhang, L., Zhou, Y. (2012). Multiple kernel learning from noisy labels by stochastic programming. In ICML.Google Scholar
  61. Yeo, G., Burge, C.B. (2003). Maximum entropy modeling of short sequence motifs with applications to rna splicing signals. In RECOMB, pp. 322–331.Google Scholar
  62. Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. JRSS, 68, 49–67.MathSciNetCrossRefMATHGoogle Scholar
  63. Zhou, T., Tao, D., Wu, X. (2010). Nesvm: A fast gradient method for support vector machines. CoRR abs/1008.4000.Google Scholar
  64. Zhou, Y., Jin, R., Hoi, S.C. (2010). Exclusive lasso for multi-task feature selection. In AISTAT, pp. 988–995.Google Scholar
  65. Zhu, J., Rosset, S., Hastie, T., Tibshirani, R. (2003). 1-norm support vector machines. In NIPS.Google Scholar
  66. Zhu, M., Chan, T. (2008). An efficient primal-dual hybrid gradient algorithm for total variation image restoration. UCLA CAM Report pp. 08–34.Google Scholar

Copyright information

© The Author(s) 2014

Authors and Affiliations

  • Tianbao Yang
    • 1
  • Mehrdad Mahdavi
    • 2
  • Rong Jin
    • 2
  • Shenghuo Zhu
    • 1
  1. 1.NEC Laboratories America, Inc.CupertinoUSA
  2. 2.Department of Computer Science and EngineeringMichigan State UniversityEast LansingUSA

Personalised recommendations