Machine Learning

, Volume 108, Issue 11, pp 1951–1974 | Cite as

Boosting as a kernel-based method

  • Aleksandr Y. AravkinEmail author
  • Giulio Bottegal
  • Gianluigi Pillonetto


Boosting combines weak (biased) learners to obtain effective learning algorithms for classification and prediction. In this paper, we show a connection between boosting and kernel-based methods, highlighting both theoretical and practical applications. In the \(\ell _2\) context, we show that boosting with a weak learner defined by a kernel K is equivalent to estimation with a special boosting kernel. The number of boosting iterations can then be modeled as a continuous hyperparameter, and fit (along with other parameters) using standard techniques. We then generalize the boosting kernel to a broad new class of boosting approaches for general weak learners, including those based on the \(\ell _1\), hinge and Vapnik losses. We develop fast hyperparameter tuning for this class, which has a wide range of applications including robust regression and classification. We illustrate several applications using synthetic and real data.


Boosting Weak learners Kernel-based methods Reproducing kernel Hilbert spaces Robust estimation 



Funding was provided by Washington Research Foundation.


  1. Anderson, B. D. O., & Moore, J. B. (1979). Optimal filtering. Englewood Cliffs, NJ: Prentice-Hall.zbMATHGoogle Scholar
  2. Aravkin, A., Burke, J., Ljung, L., Lozano, A., & Pillonetto, G. (2017). Generalized Kalman smoothing. Automatica, 86, 63–86.CrossRefMathSciNetzbMATHGoogle Scholar
  3. Aravkin, A., Kambadur, P., Lozano, A., & Luss, R. (2014). Orthogonal matching pursuit for sparse quantile regression. In International conference on data mining (ICDM) (pp. 11–19). IEEE.Google Scholar
  4. Avnimelech, R., & Intrator, N. (1999). Boosting regression estimators. Neural Computation, 11(2), 499–520.CrossRefGoogle Scholar
  5. Bissacco, A., Yang, M. H., & Soatto, S. (2007). Fast human pose estimation using appearance and motion via multi-dimensional boosting regression. In 2007 IEEE conference on computer vision and pattern recognition (pp. 1–8). IEEE.Google Scholar
  6. Bottegal, G., Aravkin, A., Hjalmarsson, H., & Pillonetto, G. (2016). Robust EM kernel-based methods for linear system identification. Automatica, 67, 114–126.CrossRefMathSciNetzbMATHGoogle Scholar
  7. Breiman, L. (1998). Arcing classifier (with discussion and a rejoinder by the author). The Annals of Statistics, 26(3), 801–849.CrossRefMathSciNetzbMATHGoogle Scholar
  8. Bube, K., & Nemeth, T. (2007). Fast line searches for the robust solution of linear systems in the hybrid \(\ell _1/\ell _2\) and huber norms. Geophysics, 72(2), A13–A17.CrossRefGoogle Scholar
  9. Bühlmann, P., & Hothorn, T. (2007). Boosting algorithms: Regularization, prediction and model fitting. Statistical Science, 22, 477–505.CrossRefMathSciNetzbMATHGoogle Scholar
  10. Bühlmann, P., & Yu, B. (2003). Boosting with the L2 loss: Regression and classification. Journal of the American Statistical Association, 98(462), 324–339.CrossRefMathSciNetzbMATHGoogle Scholar
  11. Cao, X., Wei, Y., Wen, F., & Sun, J. (2014). Face alignment by explicit shape regression. International Journal of Computer Vision, 107(2), 177–190.CrossRefMathSciNetGoogle Scholar
  12. Champion, M., Cierco-Ayrolles, C., Gadat, S., & Vignes, M. (2014). Sparse regression and support recovery with L2-boosting algorithms. Journal of Statistical Planning and Inference, 155, 19–41.CrossRefMathSciNetzbMATHGoogle Scholar
  13. Cortes, C., Gonzalvo, X., Kuznetsov, V., Mohri, M., & Yang, S. (2017) AdaNet: Adaptive structural learning of artificial neural networks. In International conference on machine learning (pp. 874–883).Google Scholar
  14. Cortes, C., Mohri, M., & Syed, U. (2014). Deep boosting. In International conference on machine learning (pp. 1179–1187).Google Scholar
  15. De Mol, C., De Vito, E., & Rosasco, L. (2009). Elastic-net regularization in learning theory. Journal of Complexity, 25(2), 201–230.CrossRefMathSciNetzbMATHGoogle Scholar
  16. Evgeniou, T., Pontil, M., & Poggio, T. (2000). Regularization networks and support vector machines. Advances in Computational Mathematics, 13, 1–150.CrossRefMathSciNetzbMATHGoogle Scholar
  17. Fan, W., Stolfo, S., & Zhang, J. (1999). The application of AdaBoost for distributed, scalable and on-line learning. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 362–366). ACM.Google Scholar
  18. Freund, Y., & Schapire, R. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.CrossRefMathSciNetzbMATHGoogle Scholar
  19. Freund, Y., Schapire, R., & Abe, N. (1999). A short introduction to boosting. Journal—Japanese Society for Artificial Intelligence, 14(771–780), 1612.Google Scholar
  20. Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting (with discussion and a rejoinder by the authors). The Annals of Statistics, 28(2), 337–407.CrossRefMathSciNetzbMATHGoogle Scholar
  21. Gao, T., & Koller, D. (2011). Multiclass boosting with hinge loss based on output coding. In Proceedings of the 28th international conference on machine learning (ICML-11) (pp. 569–576).Google Scholar
  22. Hansen, M., & Yu, B. (2001). Model selection and the principle of minimum description length. Journal of the American Statistical Association, 96(454), 746–774.CrossRefMathSciNetzbMATHGoogle Scholar
  23. Hastie, T., Tibshirani, R., & Friedman, J. (2001a). The elements of statistical learning. Springer series in statistics (Vol. 1). Berlin: Springer.zbMATHGoogle Scholar
  24. Hastie, T., Tibshirani, R., & Friedman, J. (2001b). The elements of statistical learning. Data mining, inference and prediction. Canada: Springer.zbMATHGoogle Scholar
  25. Hochstadt, H. (1973). Integral equations. New York: Wiley.zbMATHGoogle Scholar
  26. Huber, P. J. (2004). Robust statistics. New York: Wiley.Google Scholar
  27. Hurvich, C., Simonoff, J., & Tsai, C. L. (1998). Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(2), 271–293.CrossRefMathSciNetzbMATHGoogle Scholar
  28. Koenker, R. (2005). Quantile regression. Cambridge: Cambridge University Press.CrossRefzbMATHGoogle Scholar
  29. Koenker, R., & Geling, O. (2001). Reappraising medfly longevity: A quantile regression survival analysis. Journal of the American Statistical Association, 96, 458–468.CrossRefMathSciNetzbMATHGoogle Scholar
  30. Lemmens, A., & Croux, C. (2006). Bagging and boosting classification trees to predict churn. Journal of Marketing Research, 43(2), 276–286.CrossRefGoogle Scholar
  31. Li, Q., & Lin, N. (2010). The Bayesian elastic net. Bayesian Analysis, 5(1), 151–170.CrossRefMathSciNetzbMATHGoogle Scholar
  32. Ljung, L. (1999). System identification, theory for the user. Upper Saddle River: Prentice Hall.zbMATHGoogle Scholar
  33. Maronna, R., Martin, D., & Yohai, V. (2006). Robust statistics. Wiley series in probability and statistics. New York: Wiley.Google Scholar
  34. Mercer, J. (1909). Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society London, 209(3), 415–446.CrossRefzbMATHGoogle Scholar
  35. Oglic, D., & Gärtner, T. (2016). Greedy feature construction. In Advances in neural information processing systems (pp. 3945–3953).Google Scholar
  36. Pillonetto, G., & De Nicolao, G. (2010). A new kernel-based approach for linear system identification. Automatica, 46(1), 81–93.CrossRefMathSciNetzbMATHGoogle Scholar
  37. Pontil, M., & Verri, A. (1998). Properties of support vector machines. Neural Computation, 10, 955–974.CrossRefGoogle Scholar
  38. Rätsch, G., & Warmuth, M. K. (2005). Efficient margin maximizing with boosting. Journal of Machine Learning Research, 6(Dec), 2131–2152.MathSciNetzbMATHGoogle Scholar
  39. Schapire, R. (2003). The boosting approach to machine learning: An overview. In Nonlinear estimation and classification (pp. 149–171). Springer.Google Scholar
  40. Schapire, R. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–227.Google Scholar
  41. Schapire, R., & Freund, Y. (2012). Boosting: Foundations and algorithms. Cambridge: MIT Press.zbMATHGoogle Scholar
  42. Schölkopf, B., Herbrich, R., & Smola, A. J. (2001). A generalized representer theorem. Neural Networks and Computational Learning Theory, 81, 416–426.MathSciNetzbMATHGoogle Scholar
  43. Schölkopf, B., & Smola, A. J. (2001). Learning with kernels: Support vector machines, regularization, optimization, and beyond (adaptive computation and machine learning). Cambridge: MIT Press.Google Scholar
  44. Schölkopf, B., & Smola, A. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge: MIT Press.Google Scholar
  45. Schölkopf, B., Smola, A., Williamson, R., & Bartlett, P. (2000). New support vector algorithms. Neural Computation, 12, 1207–1245.CrossRefGoogle Scholar
  46. Smale, S., & Zhou, D. (2007). Learning theory estimates via integral operators and their approximations. Constructive Approximation, 26, 153–172.CrossRefMathSciNetzbMATHGoogle Scholar
  47. Solomatine, D., & Shrestha, D. (2004) AdaBoost.RT: A boosting algorithm for regression problems. In Proceedings of the 2004 IEEE international joint conference on neural networks (Vol. 2, pp. 1163–1168). IEEE.Google Scholar
  48. Steinwart, I. (2002). On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2, 67–93.MathSciNetzbMATHGoogle Scholar
  49. Sun, H. (2005). Mercer theorem for RKHS on noncompact sets. Journal of Complexity, 21(3), 337–349.CrossRefMathSciNetzbMATHGoogle Scholar
  50. Temlyakov, V. (2000). Weak greedy algorithms. Advances in Computational Mathematics, 12(2–3), 213–227.CrossRefMathSciNetzbMATHGoogle Scholar
  51. Tokarczyk, P., Wegner, J., Walk, S., & Schindler, K. (2015). Features, color spaces, and boosting: New insights on semantic classification of remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 53(1), 280–295.CrossRefGoogle Scholar
  52. Tu, Z. (2005). Probabilistic boosting-tree: Learning discriminative models for classification, recognition, and clustering. In Tenth IEEE international conference on computer vision, 2005. ICCV 2005 (Vol. 2, pp. 1589–1596). IEEE.Google Scholar
  53. Tutz, G., & Binder, H. (2007). Boosting ridge regression. Computational Statistics and Data Analysis, 51(12), 6044–6059.CrossRefMathSciNetzbMATHGoogle Scholar
  54. Vapnik, V. (1998). Statistical learning theory. New York, NY: Wiley.zbMATHGoogle Scholar
  55. Viola, P., & Jones, M. (2001). Fast and robust classification using asymmetric AdaBoost and a detector cascade. Advances in Neural Information Processing System, 14, 1311–1318.Google Scholar
  56. Wahba, G. (1990). Spline models for observational data. Philadelphia: SIAM.CrossRefzbMATHGoogle Scholar
  57. Wu, Q., Ying, Y., & Zhou, D. (2006). Learning rates of least-square regularized regression. Foundations of Computational Mathematics, 6, 171–192.CrossRefMathSciNetzbMATHGoogle Scholar
  58. Zhang, T. (2003). Sequential greedy approximation for certain convex optimization problems. IEEE Transactions on Information Theory, 49(3), 682–691.CrossRefMathSciNetzbMATHGoogle Scholar
  59. Zhu, J., Zou, H., Rosset, S., & Hastie, T. (2009). Multi-class AdaBoost. Statistics and Its Interface, 2(3), 349–360.CrossRefMathSciNetzbMATHGoogle Scholar
  60. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.CrossRefMathSciNetzbMATHGoogle Scholar
  61. Zou, H., & Yuan, M. (2008). Regularized simultaneous model selection in multiple quantiles regression. Computational Statistics and Data Analysis, 52(12), 5296–5304.CrossRefMathSciNetzbMATHGoogle Scholar

Copyright information

© The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Applied MathematicsUniversity of WashingtonSeattleUSA
  2. 2.Department of Electrical EngineeringTU EindhovenEindhovenThe Netherlands
  3. 3.Department of Information EngineeringUniversity of PadovaPaduaItaly

Personalised recommendations