Advertisement

SAAGs: Biased stochastic variance reduction methods for large-scale learning

  • Vinod Kumar ChauhanEmail author
  • Anuj Sharma
  • Kalpana Dahiya
Article
  • 49 Downloads

Abstract

Stochastic approximation is one of the effective approach to deal with the large-scale machine learning problems and the recent research has focused on reduction of variance, caused by the noisy approximations of the gradients. In this paper, we have proposed novel variants of SAAG-I and II (Stochastic Average Adjusted Gradient) (Chauhan et al. 2017), called SAAG-III and IV, respectively. Unlike SAAG-I, starting point is set to average of previous epoch in SAAG-III, and unlike SAAG-II, the snap point and starting point are set to average and last iterate of previous epoch in SAAG-IV, respectively. To determine the step size, we have used Stochastic Backtracking-Armijo line Search (SBAS) which performs line search only on selected mini-batch of data points. Since backtracking line search is not suitable for large-scale problems and the constants used to find the step size, like Lipschitz constant, are not always available so SBAS could be very effective in such cases. We have extended SAAGs (I, II, III and IV) to solve non-smooth problems and designed two update rules for smooth and non-smooth problems. Moreover, our theoretical results have proved linear convergence of SAAG-IV for all the four combinations of smoothness and strong-convexity, in expectation. Finally, our experimental studies have proved the efficacy of proposed methods against the state-of-art techniques.

Keywords

Stochastic gradient descent Stochastic optimization Variance reduction Strongly covex Smooth and non-smooth SGD Large-scale learning 

Notes

Acknowledgements

First author is thankful to Ministry of Human Resource Development, Government of INDIA, to provide fellowship (University Grants Commission - Senior Research Fellowship) to pursue his PhD.

References

  1. 1.
    Allen-Zhu Z (2017) Katyusha: The first direct acceleration of stochastic gradient methods. Journal of Machine Learning Research (to appear) Full version available at arXiv:1603.05953
  2. 2.
    Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Lechevallier Y, Saporta G (eds) Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010). http://leon.bottou.org/papers/bottou-2010. Springer, Paris, pp 177–187
  3. 3.
    Bottou L, Bousquet O (2007) The tradeoffs of large scale learning. In: Proceedings of the 20th International Conference on Neural Information Processing Systems, Curran Associates Inc., USA, NIPS’07, pp 161–168, http://dl.acm.org/citation.cfm?id=2981562.2981583
  4. 4.
    Bottou L, Curtis FE, Nocedal J (2016) Optimization Methods for Large-Scale Machine Learning. arXiv:1606.04838
  5. 5.
    Cauchy A L (1847) Méthode générale pour la résolution des systèmes d’équations simultanées. Compte Rendu des S’eances de L’Acad’emie des Sciences XXV S’erie A(25):536–538Google Scholar
  6. 6.
    Chauhan VK, Dahiya K, Sharma A (2017) Mini-batch block-coordinate based stochastic average adjusted gradient methods to solve big data problems. In: Proceedings of the Ninth Asian Conference on Machine Learning, PMLR, vol 77, pp 49–64, http://proceedings.mlr.press/v77/chauhan17a.html
  7. 7.
    Chauhan VK, Dahiya K, Sharma A (2018a) Problem formulations and solvers in linear svm: a review. Artificial Intelligence Review.  https://doi.org/10.1007/s10462-018-9614-6
  8. 8.
    Chauhan V K, Sharma A, Dahiya K (2018b) Faster learning by reduction of data access time. Appl Intell 48(12):4715–4729.  https://doi.org/10.1007/s10489-018-1235-x CrossRefGoogle Scholar
  9. 9.
    Chauhan VK, Sharma A, Dahiya K (2018c) Stochastic Trust Region Inexact Newton Method for Large-scale Machine Learning. arXiv:1812.10426
  10. 10.
    Csiba D, Richt P (2016) Importance Sampling for Minibatches pp 1–19, arXiv:1602.02283v1
  11. 11.
    Defazio A, Bach F, Lacoste-Julien S (2014) Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, NIPS’14, MIT Press, Cambridge, pp 1646–1654Google Scholar
  12. 12.
    Fanhua S, Zhou K, Cheng J, Tsang IW, Zhang L, Tao D (2018) Vr-sgd: A simple stochastic variance reduction method for machine learning. arXiv:https://arxiv.org/abs/1802.09932
  13. 13.
    Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems, vol 26, Curran Associates, Inc., pp 315–323Google Scholar
  14. 14.
    Kiefer J, Wolfowitz J (1952) Stochastic estimation of the maximum of a regression function. Ann Math Stat 23:462–466MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Konečnẏ J, Richtȧrik P (2013) Semi-Stochastic Gradient Descent Methods 1:19. arXiv:1312.1666
  16. 16.
    Konečnẏ J, Liu J, Richtȧrik P, Takȧč M (2016) Mini-Batch Semi-Stochastic Gradient descent in the proximal setting. IEEE J Sel Top Signal Process 10(2):242–255CrossRefGoogle Scholar
  17. 17.
    Lan G (2012) An optimal method for stochastic composite optimization. Math Program 133(1):365–397MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Le Roux N, Schmidt M, Bach F (2012) A Stochastic Gradient Method with an Exponential Convergence Rate for Strongly-Convex Optimization with Finite Training Sets. Technical Report, INRIAGoogle Scholar
  19. 19.
    Lin H, Mairal J, Harchaoui Z (2015) A universal catalyst for first-order optimization. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS’15, pp 3384–3392Google Scholar
  20. 20.
    Parikh N, Boyd S (2014) Proximal algorithms. Found Trends Optim 1(3):127–239CrossRefGoogle Scholar
  21. 21.
    Rakhlin A, Shamir O, Sridharan K (2012) Making gradient descent optimal for strongly convex stochastic optimization. In: Proceedings of the 29th International Conference on International Conference on Machine Learning, ICML’12, pp 1571–1578Google Scholar
  22. 22.
    Robbins H, Monro S (1951) A Stochastic Approximation Method. Ann Math Statist 22(3):400–407. Retrieved from http://www.jstor.org/stable/2236626
  23. 23.
    Shalev-Shwartz S, Zhang T (2013) Stochastic dual coordinate ascent methods for regularized loss. J Mach Learn Res 14(1):567–599MathSciNetzbMATHGoogle Scholar
  24. 24.
    Shalev-Shwartz S, Singer Y, Srebro N (2007) Pegasos: Primal estimated sub-gradient solver for svm. In: Proceedings of the 24th International Conference on Machine Learning, ICML ’07. ACM, New York, pp 807–814Google Scholar
  25. 25.
    Wang H, Banerjee A (2014) Randomized block coordinate descent for online and stochastic optimization, pp 1–19. arXiv:1407.0107
  26. 26.
    Wright S J (2015) Coordinate descent algorithms. Math Program 151(1):3–34MathSciNetCrossRefzbMATHGoogle Scholar
  27. 27.
    Xiao L, Zhang T (2014) A proximal stochastic gradient method with progressive variance reduction. SIAM J Optim 24(4):2057–2075MathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    Xu Y, Yin W (2015) Block stochastic gradient iteration for convex and nonconvex optimization. SIAM J Optim 25(3):1686– 1716MathSciNetCrossRefzbMATHGoogle Scholar
  29. 29.
    Yang Y, Chen J, Zhu J (2016) Distributing the stochastic gradient sampler for large-scale lda. In: Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16. ACM, New York, pp 1975– 1984Google Scholar
  30. 30.
    Yang Z, Wang C, Zhang Z, Li J (2018) Random barzilai–borwein step size for mini-batch algorithms. Eng Appl Artif Intell 72:124–135CrossRefGoogle Scholar
  31. 31.
    Shen Z, Qian H, Mu T, Zhang C (2017) Accelerated doubly stochastic gradient algorithm for large-scale empirical risk minimization. In: IJCAIGoogle Scholar
  32. 32.
    Zhang T (2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-first International Conference on Machine Learning, ICML’04. ACM, New York, pp 116–Google Scholar
  33. 33.
    Zhang Y, Xiao L (2015) Stochastic primal-dual coordinate method for regularized empirical risk minimization. In: Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp 353–361Google Scholar
  34. 34.
    Zhao T, Yu M, Wang Y, Arora R, Liu H (2014) Accelerated Mini-batch Randomized Block Coordinate Descent Method. Advances in Neural Information Processing Systems, pp 3329–3337Google Scholar
  35. 35.
    Zhou L, Pan S, Wang J, Vasilakos AV (2017) Machine learning on big data: Opportunities and challenges. Neurocomputing 237:350–361.  https://doi.org/10.1016/j.neucom.2017.01.026 CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Computer Science and ApplicationsPanjab UniversityChandigarhIndia
  2. 2.University Institute of Engineering and TechnologyPanjab UniversityChandigarhIndia

Personalised recommendations