Abstract
We analyze a batched variant of Stochastic gradient descent (SGD) with weighted sampling distribution for smooth and non-smooth objective functions. We show that by distributing the batches computationally, a significant speedup in the convergence rate is provably possible compared to either batched sampling or weighted sampling alone. We propose several computationally efficient schemes to approximate the optimal weights and compute proposed sampling distributions explicitly for the least squares and hinge loss problems. We show both analytically and experimentally that substantial gains can be obtained.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
D. Needell, N. Srebro, R. Ward, Stochastic gradient descent and the randomized Kaczmarz algorithm. Math. Program. Ser. A 155(1), 549–573 (2016)
P. Zhao, T. Zhang, Stochastic optimization with importance sampling for regularized loss minimization, in Proceedings of the 32nd International Conference on Machine Learning (ICML-15) (2015)
A. Cotter, O. Shamir, N. Srebro, K. Sridharan, Better mini-batch algorithms via accelerated gradient methods, in Advances in neural information processing systems (2011), pp. 1647–1655
A. Agarwal, J.C. Duchi, Distributed delayed stochastic optimization, in Advances in Neural Information Processing Systems (2011), pp. 873–881
O. Dekel, R. Gilad-Bachrach, O. Shamir, L. Xiao, Optimal distributed online prediction using mini-batches. J. Mach. Learn. Res. 13(1), 165–202 (2012)
M. Takac, A. Bijral, P. Richtarik, N. Srebro, Mini-batch primal and dual methods for SVMs, in Proceedings of the 30th International Conference on Machine Learning (ICML-13), vol. 3 (2013), pp. 1022–1030
H. Robbins, S. Monroe, A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
L. Bottou, O. Bousquet, The tradeoffs of large-scale learning, in Optimization for Machine Learning (2011), p. 351
L. Bottou, Large-scale machine learning with stochastic gradient descent, in Proceedings of COMPSTAT’2010 (Springer, Berlin, 2010), pp. 177–186
A. Nemirovski, A. Juditsky, G. Lan, A. Shapiro, Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
S. Shalev-Shwartz, N. Srebro, SVM optimization: inverse dependence on training set size, in Proceedings of the 25th international conference on Machine learning (2008), pp. 928–935
T. Strohmer, R. Vershynin, A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. 15(2), 262–278 (2009)
D. Needell, Randomized Kaczmarz solver for noisy linear systems. BIT 50(2), 395–403 (2010)
F. Bach, E. Moulines, Non-asymptotic analysis of stochastic approximation algorithms for machine learning, in Advances in Neural Information Processing Systems (NIPS) (2011)
Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
P. Richtárik, M. Takáč, On optimal probabilities in stochastic coordinate descent methods. Optim. Lett. 1–11 (2015)
Z. Qu, P. Richtarik, T. Zhang, Quartz: randomized dual coordinate ascent with arbitrary sampling, in Advances in neural information processing systems, vol. 28 (2015), pp. 865–873
D. Csiba, Z. Qu, P. Richtarik, Stochastic dual coordinate ascent with adaptive probabilities, in Proceedings of the 32nd International Conference on Machine Learning (ICML-15) (2015)
Y.T. Lee, A. Sidford, Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems, in 2013 IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS) (IEEE, 2013), pp. 147–156
M. Schmidt, N. Roux, F. Bach, Minimizing finite sums with the stochastic average gradient (2013), arXiv:1309.2388
L. Xiao, T. Zhang, A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4), 2057–2075 (2014)
A. Défossez, F.R. Bach, Averaged least-mean-squares: bias-variance trade-offs and optimal sampling distributions, in AISTATS (2015)
S. Shalev-Shwartz, Y. Singer, N. Srebro, A. Cotter, Pegasos: primal estimated sub-gradient solver for SVM. Math. Program. 127(1), 3–30 (2011)
R.H. Byrd, G.M. Chin, J. Nocedal, Y. Wu, Sample size selection in optimization methods for machine learning. Math. Program. 134(1), 127–155 (2012)
D. Needell, R. Ward, Two-subspace projection method for coherent overdetermined linear systems. J. Fourier Anal. Appl. 19(2), 256–269 (2013)
J. Konecnỳ, J. Liu, P. Richtarik, M. Takac, mS2GD: mini-batch semi-stochastic gradient descent in the proximal setting. IEEE J. Sel. Top. Signal Process. 10(2), 242–255 (2016)
M. Li, T. Zhang, Y. Chen, A.J. Smola, Efficient mini-batch training for stochastic optimization, in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2014), pp. 661–670
D. Csiba, P. Richtarik, Importance sampling for minibatches (2016), arXiv:1602.02283
R.M. Gower, P. Richtárik, Randomized quasi-Newton updates are linearly convergent matrix inversion algorithms (2016), arXiv:1602.01768
E.J. Candès, T. Tao, Decoding by linear programming. IEEE Trans. Inf. Theory 51, 4203–4215 (2005)
P. Klein, H.-I. Lu, Efficient approximation algorithms for semidefinite programs arising from max cut and coloring, in Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing (ACM, 1996), pp. 338–347
Y. Nesterov, Introductory Lectures on Convex Optimization (Kluwer, Dordrecht, 2004)
O. Shamir, T. Zhang, Stochastic gradient descent for non-smooth optimization: convergence results and optimal averaging schemes (2012), arXiv:1212.1824
A. Rakhlin, O. Shamir, K. Sridharan, Making gradient descent optimal for strongly convex stochastic optimization (2012), arXiv:1109.5647
J. Yang, Y.-L. Chow, C. Ré, M.W. Mahoney, Weighted SGD for \(\ell _p\) regression with randomized preconditioning, in Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms (SIAM, 2016), pp. 558–569
P.C. Hansen, Regularization tools version 4.0 for matlab 7.3. Numer. Algorithms 46(2), 189–194 (2007)
Acknowledgements
The authors would like to thank Anna Ma for helpful discussions about this paper, and the reviewers for their thoughtful feedback. Needell was partially supported by NSF CAREER grant #1348721 and the Alfred P. Sloan Foundation. Ward was partially supported by NSF CAREER grant #1255631.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Needell, D., Ward, R. (2017). Batched Stochastic Gradient Descent with Weighted Sampling. In: Fasshauer, G., Schumaker, L. (eds) Approximation Theory XV: San Antonio 2016. AT 2016. Springer Proceedings in Mathematics & Statistics, vol 201. Springer, Cham. https://doi.org/10.1007/978-3-319-59912-0_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-59912-0_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59911-3
Online ISBN: 978-3-319-59912-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)