Advertisement

Advances in Computational Mathematics

, Volume 45, Issue 5–6, pp 2745–2770 | Cite as

Fast and strong convergence of online learning algorithms

  • Zheng-Chu Guo
  • Lei ShiEmail author
Article
  • 56 Downloads

Abstract

In this paper, we study the online learning algorithm without explicit regularization terms. This algorithm is essentially a stochastic gradient descent scheme in a reproducing kernel Hilbert space (RKHS). The polynomially decaying step size in each iteration can play a role of regularization to ensure the generalization ability of online learning algorithm. We develop a novel capacity dependent analysis on the performance of the last iterate of online learning algorithm. This answers an open problem in learning theory. The contribution of this paper is twofold. First, our novel capacity dependent analysis can lead to sharp convergence rate in the standard mean square distance which improves the results in the literature. Second, we establish, for the first time, the strong convergence of the last iterate with polynomially decaying step sizes in the RKHS norm. We demonstrate that the theoretical analysis established in this paper fully exploits the fine structure of the underlying RKHS, and thus can lead to sharp error estimates of online learning algorithm.

Keywords

Learning theory Online learning Capacity dependent error analysis Strong convergence in an RKHS 

Mathematics Subject Classification (2010)

68Q32 68T05 62J02 62L20 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Notes

Funding information

The work described in this paper is supported partially by the National Natural Science Foundation of China (Grant Nos.11401524, 11531013, 11571078, 11631015). Lei Shi is also sponsored by Program of Shanghai Subject Chief Scientist, NSFC/RGC Joint Research Fund (Project No.18XD1400700, Project No.11461161006, and Project No.CityU 104012).

References

  1. 1.
    Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Math. Soc. 68(3), 337–404 (1950)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Birman, M.S., Solomjak, M.Z.: Piecewise-polynomial approximations of functions of the classes \({W}_{p}^{\alpha }\). Math. USSR-Sbornik 2(3), 295–317 (1967)CrossRefGoogle Scholar
  3. 3.
    Braun, M.L., Buhmann, J.M., Müller, K.-R.: On relevant dimensions in kernel feature spaces. J. Mach. Learn. Res. 9(Aug), 1875–1908 (2008)MathSciNetzbMATHGoogle Scholar
  4. 4.
    Caponnetto, A., DeVito, E.: Optimal rates for the regularized least squares algorithm. Found. Comput. Math. 7(3), 331–368 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Cucker, F., Zhou, D.X.: Learning Theory: an Approximation Theory Viewpoint. Cambridge University Press, Cambridge (2007)CrossRefzbMATHGoogle Scholar
  6. 6.
    Didas, S., Setzer, S., Steidl, G.: Combined 2 data and gradient fitting in conjunction with 1 regularization. Adv. Comput. Math. 30(1), 79–99 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Dieuleveut, A., Bach, F.: Nonparametric stochastic approximation with large step-sizes. Ann. Stat. 44(4), 1363–1399 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Gu, C.: Smoothing Spline ANOVA Models. Springer Series in Statistics. Springer, New York (2002)Google Scholar
  9. 9.
    Guo, Z.C., Ying, Y., Zhou, D.X.: Online regularized learning with pairwise loss functions. Adv. Comput. Math. 43(1), 127–150 (2017)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Kivinen, J., Smola, A.J., Williamson, R.C.: Online learning with kernels. IEEE Trans. Signal Process. 52(8), 2165–2176 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Lei, Y., Shi, L., Guo, Z.C.: Convergence of unregularized online learning algorithms. J. Mach. Learn. Res. 18(171), 1–33 (2018)MathSciNetzbMATHGoogle Scholar
  12. 12.
    Lin, J., Rosasco, L.: Optimal learning for multi-pass stochastic gradient methods. In: Advances in Neural Information Processing Systems, pp 4556–4564 (2016)Google Scholar
  13. 13.
    Mendelson, S., Neeman, J.: Regularization in kernel learning. Ann. Stat. 38(1), 526–565 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Mikusiński, J.: The Bochner Integral. Birkhäuser, Basel (1978)CrossRefzbMATHGoogle Scholar
  15. 15.
    Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19 (4), 1574–1609 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Pillaud-Vivien, L., Alessandro, R., Francis, B.: Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. In: Advances in Neural Information Processing Systems, pp 8114–8124 (2018)Google Scholar
  17. 17.
    Rakhlin, A., Shamir, O., Sridharan, K.: Making gradient descent optimal for strongly convex stochastic optimization. In: Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp 449–456 (2012)Google Scholar
  18. 18.
    Rosasco, L., Tacchetti, A., Villa, S.: Regularization by early stopping for online learning algorithms. arXiv:1405.0042 (2014)
  19. 19.
    Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: primal estimated sub-gradient solver for svm. Math. Program. 127(1), 3–30 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Smale, S., Zhou, D.X.: Estimating the approximation error in learning theory. Anal. Appl. 1(1), 17–41 (2003)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Smale, S., Yao, Y.: Online learning algorithms. Found. Comput. Math. 6 (2), 145–170 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  22. 22.
    Smale, S., Zhou, D.X.: Learning theory estimates via integral of operators and their approximations. Constr. Approx. 26(2), 153C–172 (2007)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Steinwart, I., Christmann, A.: Support Vector Machines. Springer, New York (2008)zbMATHGoogle Scholar
  24. 24.
    Steinwart, I., Hush, D.R., Scovel, C.: Optimal rates for regularized least squares regression. In: Conference on Learning Theory (2009)Google Scholar
  25. 25.
    Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: International Conference on Machine Learning, pp 1139–1147 (2013)Google Scholar
  26. 26.
    Tarres, P., Yao, Y.: Online learning as stochastic approximation of regularization paths: optimality and almost-sure convergence. IEEE Trans. Inf. Theory 60(9), 5716–5735 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  27. 27.
    Wahba, G.: Spline Models for Observational Data. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (1990)CrossRefzbMATHGoogle Scholar
  28. 28.
    Wendland, H.: Scattered Data Approximation. Cambridge University Press, Cambridge (2005)zbMATHGoogle Scholar
  29. 29.
    Yao, Y.: On complexity issues of online learning algorithms. IEEE Trans. Inf. Theory 56(12), 6470–6481 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    Yao, Y., Rosasco, L., Caponnetto, A.: On early stopping in gradient descent learning. Constr. Approx. 26(2), 289–315 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  31. 31.
    Ying, Y.: Convergence analysis of online algorithms. Adv. Comput. Math. 27 (3), 273–291 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  32. 32.
    Ying, Y., Pontil, M.: Online gradient descent learning algorithms. Found. Comput. Math. 8(5), 561–596 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
  33. 33.
    Ying, Y., Zhou, D.X.: Online regularized classification algorithms. IEEE Trans. Inf. Theory 52(11), 4775–4788 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  34. 34.
    Ying, Y., Zhou, D.X.: Online Online pairwise learning algorithms. Neural Comput. 28(4), 743–777 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
  35. 35.
    Ying, Y., Zhou, D.X.: Unregularized online learning algorithms with general loss functions. Appl. Comput. Harmon. Anal. 2(42), 224–244 (2017)MathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    Zhang, T., Yu, B.: Boosting with early stopping: convergence and consistency. Ann. Stat. 33(4), 1538–1579 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  37. 37.
    Zhou, D.X.: The covering number in learning theory. J. Complex. 18(3), 739–767 (2002)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.School of Mathematical SciencesZhejiang UniversityHangzhouPeople’s Republic of China
  2. 2.Shanghai Key Laboratory for Contemporary Applied Mathematics, School of Mathematical SciencesFudan UniversityShanghaiPeople’s Republic of China

Personalised recommendations