Fast Projections onto ℓ1,q-Norm Balls for Grouped Feature Selection

  • Suvrit Sra
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6913)


Joint sparsity is widely acknowledged as a powerful structural cue for performing feature selection in setups where variables are expected to demonstrate “grouped” behavior. Such grouped behavior is commonly modeled by Group-Lasso or Multitask Lasso-type problems, where feature selection is effected via ℓ1,q -mixed-norms. Several particular formulations for modeling groupwise sparsity have received substantial attention in the literature; and in some cases, efficient algorithms are also available. Surprisingly, for constrained formulations of fundamental importance (e.g., regression with an ℓ1, ∞ -norm constraint), highly scalable methods seem to be missing. We address this deficiency by presenting a method based on spectral projected-gradient (SPG) that can tackle ℓ1,q -constrained convex regression problems. The most crucial component of our method is an algorithm for projecting onto ℓ1,q -norm balls. We present several numerical results which show that our methods attain up to 30X speedups on large ℓ1, ∞ -multitask lasso problems. Even more dramatic are the gains for just the ℓ1, ∞ -projection subproblem: we observe almost three orders of magnitude speedups compared against the currently standard method.


Feature Selection Multiple Kernel Learn Group Lasso Proximity Operator Feature Selection Problem 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Bach, F.: Structured sparsity-inducing norms through submodular functions. In: NIPS (2010)Google Scholar
  2. 2.
    Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Convex optimization with sparsity-inducing norms. In: Sra, S., Nowozin, S., Wright, S.J. (eds.) Optimization for Machine Learning. MIT Press, Cambridge (2011)Google Scholar
  3. 3.
    Bach, F.R.: Consistency of the Group Lasso and Multiple Kernel Learning. J. Mach. Learn. Res. 9, 1179–1225 (2008)zbMATHMathSciNetGoogle Scholar
  4. 4.
    Barzilai, J., Borwein, J.M.: Two-Point Step Size Gradient Methods. IMA Journal of Numerical Analysis 8(1), 141–148 (1988)CrossRefzbMATHMathSciNetGoogle Scholar
  5. 5.
    van den Berg, E., Schmidt, M., Friedlander, M.P., Murphy, K.: Group sparsity via linear-time projection. Tech. Rep. TR-2008-09, Univ. British Columbia (June 2008)Google Scholar
  6. 6.
    Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmon (1999)zbMATHGoogle Scholar
  7. 7.
    Birgin, E.G., Martínez, J.M., Raydan, M.: Nonmonotone Spectral Projected Gradient Methods on Convex Sets. SIAM J. Opt. 10(4), 1196–1211 (2000)CrossRefzbMATHMathSciNetGoogle Scholar
  8. 8.
    Combettes, P.L., Pesquet, J.: Proximal Splitting Methods in Signal Processing. arXiv:0912.3522v4 (May 2010)Google Scholar
  9. 9.
    Dai, Y.H., Fletcher, R.: Projected Barzilai-Borwein Methods for Large-scale Box-constrained Quadratic Programming. Numerische Mathematik 100(1), 21–47 (2005)CrossRefzbMATHMathSciNetGoogle Scholar
  10. 10.
    Donoho, D.: Denoising by soft-thresholding. IEEE Tran. Inf. Theory 41(3), 613–627 (2002)CrossRefzbMATHGoogle Scholar
  11. 11.
    Duchi, J., Singer, Y.: Online and Batch Learning using Forward-Backward Splitting. JMLR (September 2009)Google Scholar
  12. 12.
    Evgeniou, T., Micchelli, C., Pontil, M.: Learning multiple tasks with kernel methods. J. Mach. Learn. Res. 6, 615–637 (2005)zbMATHMathSciNetGoogle Scholar
  13. 13.
    Evgeniou, T., Pontil, M.: Regularized multi-task learning. In: KDD (2004)Google Scholar
  14. 14.
    Friedman, J., Hastie, T., Tibshirani, R.: A note on the group lasso and a sparse group lasso. arXiv:1001.0736v1 [math.ST] (January 2010)Google Scholar
  15. 15.
    Jenatton, R., Mairal, J., Obozinski, G., Bach, F.: Proximal Methods for Sparse Hierarchical Dictionary Learning. In: ICML (2010)Google Scholar
  16. 16.
    Kim, D., Sra, S., Dhillon, I.S.: A scalable trust-region algorithm with application to mixed-norm regression. In: Int. Conf. Machine Learning (ICML) (2010)Google Scholar
  17. 17.
    Kiwiel, K.: On Linear-Time Algorithms for the Continuous Quadratic Knapsack Problem. Journal of Optimization Theory and Applications 134, 549–554 (2007)CrossRefzbMATHMathSciNetGoogle Scholar
  18. 18.
    Kowalski, M.: Sparse regression using mixed norms. Applied and Computational Harmonic Analysis 27(3), 303–324 (2009)CrossRefzbMATHMathSciNetGoogle Scholar
  19. 19.
    Liu, H., Palatucci, M., Zhang, J.: Blockwise Coordinate Descent Procedures for the Multi-task Lasso, with Applications to Neural Semantic Basis Discovery. In: Int. Conf. Machine Learning (June 2009)Google Scholar
  20. 20.
    Liu, J., Ji, S., Ye, J.: SLEP: Sparse Learning with Efficient Projections. Arizona State University (2009),
  21. 21.
    Liu, J., Ye, J.: Efficient L1/Lq Norm Regularization. arXiv:1009.4766v1 (2010)Google Scholar
  22. 22.
    Liu, J., Ye, J.: Moreau-Yosida Regularization for Grouped Tree Structure Learning. In: NIPS (2010)Google Scholar
  23. 23.
    Liu, J., Ye, J.: Efficient Euclidean projections in linear time. In: ICML (June 2009)Google Scholar
  24. 24.
    Mairal, J., Jenatton, R., Obozinski, G., Bach, F.: Network Flow Algorithms for Structured Sparsity. In: NIPS (2010)Google Scholar
  25. 25.
    Michelot, C.: A finite algorithm for finding the projection of a point onto the canonical simplex of ℝn. J. Optim. Theory Appl. 50(1), 195–200 (1986)CrossRefzbMATHMathSciNetGoogle Scholar
  26. 26.
    Obonzinski, G., Taskar, B., Jordan, M.: Multi-task feature selection. Tech. rep., UC Berkeley (June 2006)Google Scholar
  27. 27.
    Patriksson, M.: A survey on a classic core problem in operations research. Tech. Rep. 2005:33, Chalmers University of Technology and Göteborg University (October 2005)Google Scholar
  28. 28.
    Quattoni, A., Carreras, X., Collins, M., Darrell, T.: An Efficient Projection for ℓ1, ∞  Regularization. In: ICML (2009)Google Scholar
  29. 29.
    Rakotomamonjy, A., Flamary, R., Gasso, G., Canu, S.: ℓp − ℓq penalty for sparse linear and sparse multiple kernel multi-task learning. Tech. Rep. hal-00509608, version 1, INSA-Rouen (2010)Google Scholar
  30. 30.
    Rice, U.: Compressive sensing resources (October 2010),
  31. 31.
    Rish, I., Grabarnik, G.: Sparse modeling: ICML 2010 tutorial. Online (June 2010)Google Scholar
  32. 32.
    Rockafellar, R.T.: Convex Analysis. Princeton Univ. Press, Princeton (1970)CrossRefzbMATHGoogle Scholar
  33. 33.
    Schmidt, M., van den Berg, E., Friedlander, M., Murphy, K.: Optimizing Costly Functions with Simple Constraints: A Limited-Memory Projected Quasi-Newton Algorithm. In: AISTATS (2009)Google Scholar
  34. 34.
    Similä, T., Tikka, J.: Input selection and shrinkage in multiresponse linear regression. Comp. Stat. & Data Analy. 52(1), 406–422 (2007)CrossRefzbMATHMathSciNetGoogle Scholar
  35. 35.
    Tropp, J.A.: Algorithms for simultaneous sparse approximation, Part II: Convex relaxation. Signal Proc. 86(3), 589–602 (2006)CrossRefzbMATHGoogle Scholar
  36. 36.
    Turlach, B.A., Venables, W.N., Wright, S.J.: Simultaneous Variable Selection. Technometrics 27, 349–363 (2005)CrossRefMathSciNetGoogle Scholar
  37. 37.
    Yuan, M., Lin, Y.: Model Selection and Estimation in Regression with Grouped Variables. Tech. Rep. 1095, Univ. of Wisconsin, Dept. of Stat. (2004)Google Scholar
  38. 38.
    Zhang, Y., Yeung, D.Y., Xu, Q.: Probabilistic Multi-Task Feature Selection. In: NIPS (2010)Google Scholar
  39. 39.
    Zhao, P., Rocha, G., Yu, B.: The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Stat. 37(6A), 3468–3497 (2009)CrossRefzbMATHMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Suvrit Sra
    • 1
  1. 1.MPI for Intellingent SystemsTübingenGermany

Personalised recommendations