Journal of the Indian Institute of Science

, Volume 99, Issue 2, pp 201–213 | Cite as

Stochastic Gradient Descent and Its Variants in Machine Learning

  • Praneeth NetrapalliEmail author
Review Article


Stochastic gradient descent (SGD) is a fundamental algorithm which has had a profound impact on machine learning. This article surveys some important results on SGD and its variants that arose in machine learning.


Stochastic optimization Gradient descent Large scale optimization 



  1. 1.
    Autodiff. Accessed 30 Nov 2018
  2. 2.
    Autograd. Accessed 30 Nov 2018
  3. 3.
    Casadi. Aaccessed 30 Nov 2018
  4. 4.
    Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. OSDI 16:265–283Google Scholar
  5. 5.
    Agarwal N, Allen-Zhu Z, Bullins B, Hazan E, Ma T (2017) Finding approximate local minima faster than gradient descent. In: Proceedings of the 49th annual ACM SIGACT symposium on theory of computing, ACM, pp 1195–1199Google Scholar
  6. 6.
    Allen-Zhu Z (2017) Natasha 2: faster non-convex optimization than sgd. arXiv preprint arXiv:1708.08694
  7. 7.
    Allen-Zhu Z, Hazan E (2016) Variance reduction for faster non-convex optimization. In: International conference on machine learning, pp 699–707Google Scholar
  8. 8.
    Allen-Zhu Z, Li Y (2017) First efficient convergence for streaming k-PCA: a global, gap-free, and near-optimal rate. In: Foundations of computer science (FOCS), 2017 IEEE 58th annual symposium on, IEEE, pp 487–492Google Scholar
  9. 9.
    Allen-Zhu Z, Li Y (2017) Follow the compressed leader: faster online learning of eigenvectors and faster MMWU. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning, pp 116–125Google Scholar
  10. 10.
    Allen-Zhu Z, Li Y (2017) Neon2: finding local minima via first-order oracles. arXiv preprint arXiv:1711.06673
  11. 11.
    Amari S (1967) A theory of adaptive pattern classifiers. IEEE Trans Electron Comput 3:299–307CrossRefGoogle Scholar
  12. 12.
    Bach F, Moulines E (2013) Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In: Advances in neural information processing systems, pp 773–781Google Scholar
  13. 13.
    Balsubramani A, Dasgupta S, Freund Y (2013) The fast convergence of incremental PCA. In: Advances in neural information processing systems, pp 3174–3182Google Scholar
  14. 14.
    Bandeira AS, Boumal N, Voroninski V (2016) On the low-rank approach for semidefinite programs arising in synchronization and community detection. In: Conference on learning theory, pp 361–382Google Scholar
  15. 15.
    Benaïm M (1999) Dynamics of stochastic approximation algorithms. In: Seminaire de probabilites XXXIII, Springer, pp 1–68Google Scholar
  16. 16.
    Bhojanapalli S, Boumal N, Jain P, Netrapalli P (2018) Smoothed analysis for low-rank solutions to semidefinite programs in quadratic penalty form. In: Proceedings of the 31st conference on learning theory, pp 3243–3270Google Scholar
  17. 17.
    Bordes A, Bottou L, Gallinari P (2009) SGD-QN: careful quasi-Newton stochastic gradient descent. J Mach Learn Res 10(Jul):1737–1754Google Scholar
  18. 18.
    Borkar VS (2009) Stochastic approximation: a dynamical systems viewpoint, vol 48. Springer, BerlinGoogle Scholar
  19. 19.
    Bottou L, Bousquet O (2008) The tradeoffs of large scale learning. In: Advances in neural information processing systems, pp 161–168Google Scholar
  20. 20.
    Bottou L, Curtis FE, Nocedal J (2018) Optimization methods for large-scale machine learning. SIAM Rev 60(2):223–311CrossRefGoogle Scholar
  21. 21.
    Byrd RH, Hansen SL, Nocedal J, Singer Y (2016) A stochastic quasi-Newton method for large-scale optimization. SIAM J Optim 26(2):1008–1031CrossRefGoogle Scholar
  22. 22.
    Candès EJ, Recht B (2009) Exact matrix completion via convex optimization. Found Comput Math 9(6):717CrossRefGoogle Scholar
  23. 23.
    Cauchy A (1847) Compte rendu des s eances de lacademie des sciences. Comptes Rendus Hebd Seances Acad Sci 21(25):536–538Google Scholar
  24. 24.
    Choromanska A, Henaff M, Mathieu M, Arous GB, LeCun Y (2015) The loss surfaces of multilayer networks. In: Artificial intelligence and statistics, pp 192–204Google Scholar
  25. 25.
    Dauphin YN, Pascanu R, Gulcehre C, Cho K, Ganguli S, Bengio Y (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in neural information processing systems, pp 2933–2941Google Scholar
  26. 26.
    Dekel O, Gilad-Bachrach R, Shamir O, Xiao L (2012) Optimal distributed online prediction using mini-batches. J Mach Learn Res 13(Jan):165–202Google Scholar
  27. 27.
    Dieuleveut A, Durmus A, Bach F (2017) Bridging the gap between constant step size stochastic gradient descent and markov chains. arXiv preprint arXiv:1707.06386
  28. 28.
    Du SS, Jin C, Lee JD, Jordan MI, Singh A, Poczos B (2017) Gradient descent can take exponential time to escape saddle points. In: Advances in neural information processing systems, pp 1067–1077Google Scholar
  29. 29.
    Ge R, Huang F, Jin C, Yuan Y (2015) Escaping from saddle pointsonline stochastic gradient for tensor decomposition. In: Conference on learning theory, pp 797–842Google Scholar
  30. 30.
    Ge R, Jin C, Zheng Y (2017) No spurious local minima in nonconvex low rank problems: a unified geometric analysis. arXiv preprint arXiv:1704.00708
  31. 31.
    Hall PM, Marshall AD, Martin RR (1998) Incremental eigenanalysis for classification. In: BMVC, vol 98, Citeseer, pp 286–295Google Scholar
  32. 32.
    Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Syst Appl 13(4):18–28CrossRefGoogle Scholar
  33. 33.
    Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24(6):417CrossRefGoogle Scholar
  34. 34.
    Jain P, Jin C, Kakade SM, Netrapalli P, Sidford A (2016) Streaming PCA: matching matrix Bernstein and near-optimal finite sample guarantees for Oja’s algorithm. In: Conference on learning theory, pp 1147–1164Google Scholar
  35. 35.
    Jain P, Kakade SM, Kidambi R, Netrapalli P, Sidford A (2017) Accelerating stochastic gradient descent. arXiv preprint arXiv:1704.08227
  36. 36.
    Jain P, Kakade SM, Kidambi R, Netrapalli P, Sidford A (2018) Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification. J Mach Learn Res 18(223):1–42.
  37. 37.
    Jin C, Ge R, Netrapalli P, Kakade SM, Jordan MI (2017) How to escape saddle points efficiently. In: Proceedings of the 34th international conference on machine learning, pp 1724–1732Google Scholar
  38. 38.
    Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in neural information processing systems, pp 315–323Google Scholar
  39. 39.
    Jolliffe I (2011) Principal component analysis. In: International encyclopedia of statistical science, Springer, pp 1094–1096Google Scholar
  40. 40.
    Kawaguchi K (2016) Deep learning without poor local minima. In: Advances in neural information processing systems, pp 586–594Google Scholar
  41. 41.
    Kiefer J, Wolfowitz J (1952) Stochastic estimation of the maximum of a regression function. Ann Math Stat 23(3):462–466CrossRefGoogle Scholar
  42. 42.
    Kushner H, Yin GG (2003) Stochastic approximation and recursive algorithms and applications, vol 35. Springer Science & Business Media, BerlinGoogle Scholar
  43. 43.
    Kushner HJ, Clark DS (2012) Stochastic approximation methods for constrained and unconstrained systems, vol 26. Springer Science & Business Media, BerlinGoogle Scholar
  44. 44.
    Lan G (2012) An optimal method for stochastic composite optimization. Math Program 133(1–2):365–397CrossRefGoogle Scholar
  45. 45.
    LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRefGoogle Scholar
  46. 46.
    LeCun YA, Bottou L, Orr GB, Müller KR (2012) Efficient backprop. In: Neural networks: tricks of the trade, Springer, pp 9–48Google Scholar
  47. 47.
    Lee JD, Simchowitz M, Jordan MI, Recht B (2016) Gradient descent only converges to minimizers. In: Conference on learning theory, pp 1246–1257Google Scholar
  48. 48.
    Li CJ, Wang M, Liu H, Zhang T (2018) Near-optimal stochastic approximation for online principal component estimation. Math Program 167(1):75–97CrossRefGoogle Scholar
  49. 49.
    Mei S, Misiakiewicz T, Montanari A, Oliveira RI (2017) Solving SDPS for synchronization and MaxCut problems via the Grothendieck inequality. arXiv preprint arXiv:1703.08729
  50. 50.
    Moulines E, Bach FR (2011) Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Advances in neural information processing systems, pp 451–459Google Scholar
  51. 51.
    Nemirovsky AS, Yudin DB (1983) Problem complexity and method efficiency in optimizationGoogle Scholar
  52. 52.
    Nocedal J, Wright SJ (2006) Numerical optimization 2ndGoogle Scholar
  53. 53.
    Oja E (1982) Simplified neuron model as a principal component analyzer. J Math Biol 15(3):267–273CrossRefGoogle Scholar
  54. 54.
    Panageas I, Piliouras G (2016) Gradient descent only converges to minimizers: non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405
  55. 55.
    Paszke A, Gross S, Chintala S, Chanan G (2017) Pytorch. Accessed 1 Nov 2018
  56. 56.
    Polyak BT, Juditsky AB (1992) Acceleration of stochastic approximation by averaging. SIAM J Control Optim 30(4):838–855CrossRefGoogle Scholar
  57. 57.
    Recht B, Re C, Wright S, Niu F (2011) Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in neural information processing systems, pp 693–701Google Scholar
  58. 58.
    Reddi SJ, Hefny A, Sra S, Poczos B, Smola A (2016) Stochastic variance reduction for nonconvex optimization. In: International conference on machine learning, pp 314–323Google Scholar
  59. 59.
    Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22(3):400–407CrossRefGoogle Scholar
  60. 60.
    Ross DA, Lim J, Lin RS, Yang MH (2008) Incremental learning for robust visual tracking. Int J Comput Vis 77(1–3):125–141CrossRefGoogle Scholar
  61. 61.
    Schmidt M, Le Roux N, Bach F (2017) Minimizing finite sums with the stochastic average gradient. Math Program 162(1–2):83–112CrossRefGoogle Scholar
  62. 62.
    Schraudolph NN, Yu J, Günter S (2007) A stochastic quasi-Newton method for online convex optimization. In: Artificial intelligence and statistics, pp 436–443Google Scholar
  63. 63.
    Shalev-Shwartz S, Zhang T (2013) Stochastic dual coordinate ascent methods for regularized loss minimization. J Mach Learn Res 14(Feb):567–599Google Scholar
  64. 64.
    Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol) 58(1):267–288Google Scholar
  65. 65.
    Tsypkin YZ, Nikolic ZJ (1971) Adaptation and learning in automatic systems, vol 73. Academic Press, New YorkGoogle Scholar
  66. 66.
    Wang X, Ma S, Goldfarb D, Liu W (2017) Stochastic quasi-Newton methods for nonconvex stochastic optimization. SIAM J Optim 27(2):927–956CrossRefGoogle Scholar
  67. 67.
    Weng J, Zhang Y, Hwang WS (2003) Candid covariance-free incremental principal component analysis. IEEE Trans Pattern Anal Mach Intell 25(8):1034–1040CrossRefGoogle Scholar
  68. 68.
    Yann L (1987) Modeles connexionnistes de lapprentissage. Ph.D. thesis, These de Doctorat, Universite Paris 6Google Scholar

Copyright information

© Indian Institute of Science 2019

Authors and Affiliations

  1. 1.Microsoft ResearchBengaluruIndia

Personalised recommendations