Skip to main content
Log in

Stochastic Gradient Descent and Its Variants in Machine Learning

  • Review Article
  • Published:
Journal of the Indian Institute of Science Aims and scope

Abstract

Stochastic gradient descent (SGD) is a fundamental algorithm which has had a profound impact on machine learning. This article surveys some important results on SGD and its variants that arose in machine learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Autodiff. https://pypi.org/project/autodiff/. Accessed 30 Nov 2018

  2. Autograd. https://github.com/HIPS/autograd. Accessed 30 Nov 2018

  3. Casadi. https://web.casadi.org/. Aaccessed 30 Nov 2018

  4. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. OSDI 16:265–283

    Google Scholar 

  5. Agarwal N, Allen-Zhu Z, Bullins B, Hazan E, Ma T (2017) Finding approximate local minima faster than gradient descent. In: Proceedings of the 49th annual ACM SIGACT symposium on theory of computing, ACM, pp 1195–1199

  6. Allen-Zhu Z (2017) Natasha 2: faster non-convex optimization than sgd. arXiv preprint arXiv:1708.08694

  7. Allen-Zhu Z, Hazan E (2016) Variance reduction for faster non-convex optimization. In: International conference on machine learning, pp 699–707

  8. Allen-Zhu Z, Li Y (2017) First efficient convergence for streaming k-PCA: a global, gap-free, and near-optimal rate. In: Foundations of computer science (FOCS), 2017 IEEE 58th annual symposium on, IEEE, pp 487–492

  9. Allen-Zhu Z, Li Y (2017) Follow the compressed leader: faster online learning of eigenvectors and faster MMWU. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning, pp 116–125

  10. Allen-Zhu Z, Li Y (2017) Neon2: finding local minima via first-order oracles. arXiv preprint arXiv:1711.06673

  11. Amari S (1967) A theory of adaptive pattern classifiers. IEEE Trans Electron Comput 3:299–307

    Article  Google Scholar 

  12. Bach F, Moulines E (2013) Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In: Advances in neural information processing systems, pp 773–781

  13. Balsubramani A, Dasgupta S, Freund Y (2013) The fast convergence of incremental PCA. In: Advances in neural information processing systems, pp 3174–3182

  14. Bandeira AS, Boumal N, Voroninski V (2016) On the low-rank approach for semidefinite programs arising in synchronization and community detection. In: Conference on learning theory, pp 361–382

  15. Benaïm M (1999) Dynamics of stochastic approximation algorithms. In: Seminaire de probabilites XXXIII, Springer, pp 1–68

  16. Bhojanapalli S, Boumal N, Jain P, Netrapalli P (2018) Smoothed analysis for low-rank solutions to semidefinite programs in quadratic penalty form. In: Proceedings of the 31st conference on learning theory, pp 3243–3270

  17. Bordes A, Bottou L, Gallinari P (2009) SGD-QN: careful quasi-Newton stochastic gradient descent. J Mach Learn Res 10(Jul):1737–1754

    Google Scholar 

  18. Borkar VS (2009) Stochastic approximation: a dynamical systems viewpoint, vol 48. Springer, Berlin

    Google Scholar 

  19. Bottou L, Bousquet O (2008) The tradeoffs of large scale learning. In: Advances in neural information processing systems, pp 161–168

  20. Bottou L, Curtis FE, Nocedal J (2018) Optimization methods for large-scale machine learning. SIAM Rev 60(2):223–311

    Article  Google Scholar 

  21. Byrd RH, Hansen SL, Nocedal J, Singer Y (2016) A stochastic quasi-Newton method for large-scale optimization. SIAM J Optim 26(2):1008–1031

    Article  Google Scholar 

  22. Candès EJ, Recht B (2009) Exact matrix completion via convex optimization. Found Comput Math 9(6):717

    Article  Google Scholar 

  23. Cauchy A (1847) Compte rendu des s eances de lacademie des sciences. Comptes Rendus Hebd Seances Acad Sci 21(25):536–538

    Google Scholar 

  24. Choromanska A, Henaff M, Mathieu M, Arous GB, LeCun Y (2015) The loss surfaces of multilayer networks. In: Artificial intelligence and statistics, pp 192–204

  25. Dauphin YN, Pascanu R, Gulcehre C, Cho K, Ganguli S, Bengio Y (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in neural information processing systems, pp 2933–2941

  26. Dekel O, Gilad-Bachrach R, Shamir O, Xiao L (2012) Optimal distributed online prediction using mini-batches. J Mach Learn Res 13(Jan):165–202

    Google Scholar 

  27. Dieuleveut A, Durmus A, Bach F (2017) Bridging the gap between constant step size stochastic gradient descent and markov chains. arXiv preprint arXiv:1707.06386

  28. Du SS, Jin C, Lee JD, Jordan MI, Singh A, Poczos B (2017) Gradient descent can take exponential time to escape saddle points. In: Advances in neural information processing systems, pp 1067–1077

  29. Ge R, Huang F, Jin C, Yuan Y (2015) Escaping from saddle pointsonline stochastic gradient for tensor decomposition. In: Conference on learning theory, pp 797–842

  30. Ge R, Jin C, Zheng Y (2017) No spurious local minima in nonconvex low rank problems: a unified geometric analysis. arXiv preprint arXiv:1704.00708

  31. Hall PM, Marshall AD, Martin RR (1998) Incremental eigenanalysis for classification. In: BMVC, vol 98, Citeseer, pp 286–295

  32. Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Syst Appl 13(4):18–28

    Article  Google Scholar 

  33. Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24(6):417

    Article  Google Scholar 

  34. Jain P, Jin C, Kakade SM, Netrapalli P, Sidford A (2016) Streaming PCA: matching matrix Bernstein and near-optimal finite sample guarantees for Oja’s algorithm. In: Conference on learning theory, pp 1147–1164

  35. Jain P, Kakade SM, Kidambi R, Netrapalli P, Sidford A (2017) Accelerating stochastic gradient descent. arXiv preprint arXiv:1704.08227

  36. Jain P, Kakade SM, Kidambi R, Netrapalli P, Sidford A (2018) Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification. J Mach Learn Res 18(223):1–42. http://jmlr.org/papers/v18/16-595.html

  37. Jin C, Ge R, Netrapalli P, Kakade SM, Jordan MI (2017) How to escape saddle points efficiently. In: Proceedings of the 34th international conference on machine learning, pp 1724–1732

  38. Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in neural information processing systems, pp 315–323

  39. Jolliffe I (2011) Principal component analysis. In: International encyclopedia of statistical science, Springer, pp 1094–1096

  40. Kawaguchi K (2016) Deep learning without poor local minima. In: Advances in neural information processing systems, pp 586–594

  41. Kiefer J, Wolfowitz J (1952) Stochastic estimation of the maximum of a regression function. Ann Math Stat 23(3):462–466

    Article  Google Scholar 

  42. Kushner H, Yin GG (2003) Stochastic approximation and recursive algorithms and applications, vol 35. Springer Science & Business Media, Berlin

    Google Scholar 

  43. Kushner HJ, Clark DS (2012) Stochastic approximation methods for constrained and unconstrained systems, vol 26. Springer Science & Business Media, Berlin

    Google Scholar 

  44. Lan G (2012) An optimal method for stochastic composite optimization. Math Program 133(1–2):365–397

    Article  Google Scholar 

  45. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  46. LeCun YA, Bottou L, Orr GB, Müller KR (2012) Efficient backprop. In: Neural networks: tricks of the trade, Springer, pp 9–48

  47. Lee JD, Simchowitz M, Jordan MI, Recht B (2016) Gradient descent only converges to minimizers. In: Conference on learning theory, pp 1246–1257

  48. Li CJ, Wang M, Liu H, Zhang T (2018) Near-optimal stochastic approximation for online principal component estimation. Math Program 167(1):75–97

    Article  Google Scholar 

  49. Mei S, Misiakiewicz T, Montanari A, Oliveira RI (2017) Solving SDPS for synchronization and MaxCut problems via the Grothendieck inequality. arXiv preprint arXiv:1703.08729

  50. Moulines E, Bach FR (2011) Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Advances in neural information processing systems, pp 451–459

  51. Nemirovsky AS, Yudin DB (1983) Problem complexity and method efficiency in optimization

  52. Nocedal J, Wright SJ (2006) Numerical optimization 2nd

  53. Oja E (1982) Simplified neuron model as a principal component analyzer. J Math Biol 15(3):267–273

    Article  Google Scholar 

  54. Panageas I, Piliouras G (2016) Gradient descent only converges to minimizers: non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405

  55. Paszke A, Gross S, Chintala S, Chanan G (2017) Pytorch. https://pytorch.org/. Accessed 1 Nov 2018

  56. Polyak BT, Juditsky AB (1992) Acceleration of stochastic approximation by averaging. SIAM J Control Optim 30(4):838–855

    Article  Google Scholar 

  57. Recht B, Re C, Wright S, Niu F (2011) Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in neural information processing systems, pp 693–701

  58. Reddi SJ, Hefny A, Sra S, Poczos B, Smola A (2016) Stochastic variance reduction for nonconvex optimization. In: International conference on machine learning, pp 314–323

  59. Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22(3):400–407

    Article  Google Scholar 

  60. Ross DA, Lim J, Lin RS, Yang MH (2008) Incremental learning for robust visual tracking. Int J Comput Vis 77(1–3):125–141

    Article  Google Scholar 

  61. Schmidt M, Le Roux N, Bach F (2017) Minimizing finite sums with the stochastic average gradient. Math Program 162(1–2):83–112

    Article  Google Scholar 

  62. Schraudolph NN, Yu J, Günter S (2007) A stochastic quasi-Newton method for online convex optimization. In: Artificial intelligence and statistics, pp 436–443

  63. Shalev-Shwartz S, Zhang T (2013) Stochastic dual coordinate ascent methods for regularized loss minimization. J Mach Learn Res 14(Feb):567–599

    Google Scholar 

  64. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol) 58(1):267–288

    Google Scholar 

  65. Tsypkin YZ, Nikolic ZJ (1971) Adaptation and learning in automatic systems, vol 73. Academic Press, New York

    Google Scholar 

  66. Wang X, Ma S, Goldfarb D, Liu W (2017) Stochastic quasi-Newton methods for nonconvex stochastic optimization. SIAM J Optim 27(2):927–956

    Article  Google Scholar 

  67. Weng J, Zhang Y, Hwang WS (2003) Candid covariance-free incremental principal component analysis. IEEE Trans Pattern Anal Mach Intell 25(8):1034–1040

    Article  Google Scholar 

  68. Yann L (1987) Modeles connexionnistes de lapprentissage. Ph.D. thesis, These de Doctorat, Universite Paris 6

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Praneeth Netrapalli.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Special issue—Recent Advances in Machine Learning.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Netrapalli, P. Stochastic Gradient Descent and Its Variants in Machine Learning. J Indian Inst Sci 99, 201–213 (2019). https://doi.org/10.1007/s41745-019-0098-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41745-019-0098-4

Keywords

Navigation