Stochastic Gradient Descent and Its Variants in Machine Learning

Netrapalli, Praneeth

doi:10.1007/s41745-019-0098-4

Stochastic Gradient Descent and Its Variants in Machine Learning

Review Article
Published: 12 February 2019

Volume 99, pages 201–213, (2019)
Cite this article

Journal of the Indian Institute of Science Aims and scope

Praneeth Netrapalli ORCID: orcid.org/0000-0003-3863-6162¹

2535 Accesses
52 Citations
Explore all metrics

Abstract

Stochastic gradient descent (SGD) is a fundamental algorithm which has had a profound impact on machine learning. This article surveys some important results on SGD and its variants that arose in machine learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

First-Order and Second-Order Variants of the Gradient Descent in a Unified Framework

A Heuristic Adaptive Fast Gradient Method in Stochastic Optimization Problems

Article 01 July 2020

A family of three-term conjugate gradient methods with sufficient descent property for unconstrained optimization

Article 20 May 2014

References

Autodiff. https://pypi.org/project/autodiff/. Accessed 30 Nov 2018
Autograd. https://github.com/HIPS/autograd. Accessed 30 Nov 2018
Casadi. https://web.casadi.org/. Aaccessed 30 Nov 2018
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. OSDI 16:265–283
Google Scholar
Agarwal N, Allen-Zhu Z, Bullins B, Hazan E, Ma T (2017) Finding approximate local minima faster than gradient descent. In: Proceedings of the 49th annual ACM SIGACT symposium on theory of computing, ACM, pp 1195–1199
Allen-Zhu Z (2017) Natasha 2: faster non-convex optimization than sgd. arXiv preprint arXiv:1708.08694
Allen-Zhu Z, Hazan E (2016) Variance reduction for faster non-convex optimization. In: International conference on machine learning, pp 699–707
Allen-Zhu Z, Li Y (2017) First efficient convergence for streaming k-PCA: a global, gap-free, and near-optimal rate. In: Foundations of computer science (FOCS), 2017 IEEE 58th annual symposium on, IEEE, pp 487–492
Allen-Zhu Z, Li Y (2017) Follow the compressed leader: faster online learning of eigenvectors and faster MMWU. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning, pp 116–125
Allen-Zhu Z, Li Y (2017) Neon2: finding local minima via first-order oracles. arXiv preprint arXiv:1711.06673
Amari S (1967) A theory of adaptive pattern classifiers. IEEE Trans Electron Comput 3:299–307
Article Google Scholar
Bach F, Moulines E (2013) Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In: Advances in neural information processing systems, pp 773–781
Balsubramani A, Dasgupta S, Freund Y (2013) The fast convergence of incremental PCA. In: Advances in neural information processing systems, pp 3174–3182
Bandeira AS, Boumal N, Voroninski V (2016) On the low-rank approach for semidefinite programs arising in synchronization and community detection. In: Conference on learning theory, pp 361–382
Benaïm M (1999) Dynamics of stochastic approximation algorithms. In: Seminaire de probabilites XXXIII, Springer, pp 1–68
Bhojanapalli S, Boumal N, Jain P, Netrapalli P (2018) Smoothed analysis for low-rank solutions to semidefinite programs in quadratic penalty form. In: Proceedings of the 31st conference on learning theory, pp 3243–3270
Bordes A, Bottou L, Gallinari P (2009) SGD-QN: careful quasi-Newton stochastic gradient descent. J Mach Learn Res 10(Jul):1737–1754
Google Scholar
Borkar VS (2009) Stochastic approximation: a dynamical systems viewpoint, vol 48. Springer, Berlin
Google Scholar
Bottou L, Bousquet O (2008) The tradeoffs of large scale learning. In: Advances in neural information processing systems, pp 161–168
Bottou L, Curtis FE, Nocedal J (2018) Optimization methods for large-scale machine learning. SIAM Rev 60(2):223–311
Article Google Scholar
Byrd RH, Hansen SL, Nocedal J, Singer Y (2016) A stochastic quasi-Newton method for large-scale optimization. SIAM J Optim 26(2):1008–1031
Article Google Scholar
Candès EJ, Recht B (2009) Exact matrix completion via convex optimization. Found Comput Math 9(6):717
Article Google Scholar
Cauchy A (1847) Compte rendu des s eances de lacademie des sciences. Comptes Rendus Hebd Seances Acad Sci 21(25):536–538
Google Scholar
Choromanska A, Henaff M, Mathieu M, Arous GB, LeCun Y (2015) The loss surfaces of multilayer networks. In: Artificial intelligence and statistics, pp 192–204
Dauphin YN, Pascanu R, Gulcehre C, Cho K, Ganguli S, Bengio Y (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in neural information processing systems, pp 2933–2941
Dekel O, Gilad-Bachrach R, Shamir O, Xiao L (2012) Optimal distributed online prediction using mini-batches. J Mach Learn Res 13(Jan):165–202
Google Scholar
Dieuleveut A, Durmus A, Bach F (2017) Bridging the gap between constant step size stochastic gradient descent and markov chains. arXiv preprint arXiv:1707.06386
Du SS, Jin C, Lee JD, Jordan MI, Singh A, Poczos B (2017) Gradient descent can take exponential time to escape saddle points. In: Advances in neural information processing systems, pp 1067–1077
Ge R, Huang F, Jin C, Yuan Y (2015) Escaping from saddle pointsonline stochastic gradient for tensor decomposition. In: Conference on learning theory, pp 797–842
Ge R, Jin C, Zheng Y (2017) No spurious local minima in nonconvex low rank problems: a unified geometric analysis. arXiv preprint arXiv:1704.00708
Hall PM, Marshall AD, Martin RR (1998) Incremental eigenanalysis for classification. In: BMVC, vol 98, Citeseer, pp 286–295
Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Syst Appl 13(4):18–28
Article Google Scholar
Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24(6):417
Article Google Scholar
Jain P, Jin C, Kakade SM, Netrapalli P, Sidford A (2016) Streaming PCA: matching matrix Bernstein and near-optimal finite sample guarantees for Oja’s algorithm. In: Conference on learning theory, pp 1147–1164
Jain P, Kakade SM, Kidambi R, Netrapalli P, Sidford A (2017) Accelerating stochastic gradient descent. arXiv preprint arXiv:1704.08227
Jain P, Kakade SM, Kidambi R, Netrapalli P, Sidford A (2018) Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification. J Mach Learn Res 18(223):1–42. http://jmlr.org/papers/v18/16-595.html
Jin C, Ge R, Netrapalli P, Kakade SM, Jordan MI (2017) How to escape saddle points efficiently. In: Proceedings of the 34th international conference on machine learning, pp 1724–1732
Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in neural information processing systems, pp 315–323
Jolliffe I (2011) Principal component analysis. In: International encyclopedia of statistical science, Springer, pp 1094–1096
Kawaguchi K (2016) Deep learning without poor local minima. In: Advances in neural information processing systems, pp 586–594
Kiefer J, Wolfowitz J (1952) Stochastic estimation of the maximum of a regression function. Ann Math Stat 23(3):462–466
Article Google Scholar
Kushner H, Yin GG (2003) Stochastic approximation and recursive algorithms and applications, vol 35. Springer Science & Business Media, Berlin
Google Scholar
Kushner HJ, Clark DS (2012) Stochastic approximation methods for constrained and unconstrained systems, vol 26. Springer Science & Business Media, Berlin
Google Scholar
Lan G (2012) An optimal method for stochastic composite optimization. Math Program 133(1–2):365–397
Article Google Scholar
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
LeCun YA, Bottou L, Orr GB, Müller KR (2012) Efficient backprop. In: Neural networks: tricks of the trade, Springer, pp 9–48
Lee JD, Simchowitz M, Jordan MI, Recht B (2016) Gradient descent only converges to minimizers. In: Conference on learning theory, pp 1246–1257
Li CJ, Wang M, Liu H, Zhang T (2018) Near-optimal stochastic approximation for online principal component estimation. Math Program 167(1):75–97
Article Google Scholar
Mei S, Misiakiewicz T, Montanari A, Oliveira RI (2017) Solving SDPS for synchronization and MaxCut problems via the Grothendieck inequality. arXiv preprint arXiv:1703.08729
Moulines E, Bach FR (2011) Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Advances in neural information processing systems, pp 451–459
Nemirovsky AS, Yudin DB (1983) Problem complexity and method efficiency in optimization
Nocedal J, Wright SJ (2006) Numerical optimization 2nd
Oja E (1982) Simplified neuron model as a principal component analyzer. J Math Biol 15(3):267–273
Article Google Scholar
Panageas I, Piliouras G (2016) Gradient descent only converges to minimizers: non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405
Paszke A, Gross S, Chintala S, Chanan G (2017) Pytorch. https://pytorch.org/. Accessed 1 Nov 2018
Polyak BT, Juditsky AB (1992) Acceleration of stochastic approximation by averaging. SIAM J Control Optim 30(4):838–855
Article Google Scholar
Recht B, Re C, Wright S, Niu F (2011) Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in neural information processing systems, pp 693–701
Reddi SJ, Hefny A, Sra S, Poczos B, Smola A (2016) Stochastic variance reduction for nonconvex optimization. In: International conference on machine learning, pp 314–323
Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22(3):400–407
Article Google Scholar
Ross DA, Lim J, Lin RS, Yang MH (2008) Incremental learning for robust visual tracking. Int J Comput Vis 77(1–3):125–141
Article Google Scholar
Schmidt M, Le Roux N, Bach F (2017) Minimizing finite sums with the stochastic average gradient. Math Program 162(1–2):83–112
Article Google Scholar
Schraudolph NN, Yu J, Günter S (2007) A stochastic quasi-Newton method for online convex optimization. In: Artificial intelligence and statistics, pp 436–443
Shalev-Shwartz S, Zhang T (2013) Stochastic dual coordinate ascent methods for regularized loss minimization. J Mach Learn Res 14(Feb):567–599
Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol) 58(1):267–288
Google Scholar
Tsypkin YZ, Nikolic ZJ (1971) Adaptation and learning in automatic systems, vol 73. Academic Press, New York
Google Scholar
Wang X, Ma S, Goldfarb D, Liu W (2017) Stochastic quasi-Newton methods for nonconvex stochastic optimization. SIAM J Optim 27(2):927–956
Article Google Scholar
Weng J, Zhang Y, Hwang WS (2003) Candid covariance-free incremental principal component analysis. IEEE Trans Pattern Anal Mach Intell 25(8):1034–1040
Article Google Scholar
Yann L (1987) Modeles connexionnistes de lapprentissage. Ph.D. thesis, These de Doctorat, Universite Paris 6

Download references

Author information

Authors and Affiliations

Microsoft Research, #9, Vigyan 1st Floor, Lavelle Road, Bengaluru, Karnataka, 560001, India
Praneeth Netrapalli

Authors

Praneeth Netrapalli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Praneeth Netrapalli.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Special issue—Recent Advances in Machine Learning.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Netrapalli, P. Stochastic Gradient Descent and Its Variants in Machine Learning. J Indian Inst Sci 99, 201–213 (2019). https://doi.org/10.1007/s41745-019-0098-4

Download citation

Received: 12 November 2018
Accepted: 16 January 2019
Published: 12 February 2019
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s41745-019-0098-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stochastic Gradient Descent and Its Variants in Machine Learning

Abstract

Access this article

Similar content being viewed by others

First-Order and Second-Order Variants of the Gradient Descent in a Unified Framework

A Heuristic Adaptive Fast Gradient Method in Stochastic Optimization Problems

A family of three-term conjugate gradient methods with sufficient descent property for unconstrained optimization

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Stochastic Gradient Descent and Its Variants in Machine Learning

Abstract

Access this article

Similar content being viewed by others

First-Order and Second-Order Variants of the Gradient Descent in a Unified Framework

A Heuristic Adaptive Fast Gradient Method in Stochastic Optimization Problems

A family of three-term conjugate gradient methods with sufficient descent property for unconstrained optimization

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation