Abstract
Stochastic gradient descent (SGD) is a fundamental algorithm which has had a profound impact on machine learning. This article surveys some important results on SGD and its variants that arose in machine learning.
Similar content being viewed by others
References
Autodiff. https://pypi.org/project/autodiff/. Accessed 30 Nov 2018
Autograd. https://github.com/HIPS/autograd. Accessed 30 Nov 2018
Casadi. https://web.casadi.org/. Aaccessed 30 Nov 2018
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. OSDI 16:265–283
Agarwal N, Allen-Zhu Z, Bullins B, Hazan E, Ma T (2017) Finding approximate local minima faster than gradient descent. In: Proceedings of the 49th annual ACM SIGACT symposium on theory of computing, ACM, pp 1195–1199
Allen-Zhu Z (2017) Natasha 2: faster non-convex optimization than sgd. arXiv preprint arXiv:1708.08694
Allen-Zhu Z, Hazan E (2016) Variance reduction for faster non-convex optimization. In: International conference on machine learning, pp 699–707
Allen-Zhu Z, Li Y (2017) First efficient convergence for streaming k-PCA: a global, gap-free, and near-optimal rate. In: Foundations of computer science (FOCS), 2017 IEEE 58th annual symposium on, IEEE, pp 487–492
Allen-Zhu Z, Li Y (2017) Follow the compressed leader: faster online learning of eigenvectors and faster MMWU. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning, pp 116–125
Allen-Zhu Z, Li Y (2017) Neon2: finding local minima via first-order oracles. arXiv preprint arXiv:1711.06673
Amari S (1967) A theory of adaptive pattern classifiers. IEEE Trans Electron Comput 3:299–307
Bach F, Moulines E (2013) Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In: Advances in neural information processing systems, pp 773–781
Balsubramani A, Dasgupta S, Freund Y (2013) The fast convergence of incremental PCA. In: Advances in neural information processing systems, pp 3174–3182
Bandeira AS, Boumal N, Voroninski V (2016) On the low-rank approach for semidefinite programs arising in synchronization and community detection. In: Conference on learning theory, pp 361–382
Benaïm M (1999) Dynamics of stochastic approximation algorithms. In: Seminaire de probabilites XXXIII, Springer, pp 1–68
Bhojanapalli S, Boumal N, Jain P, Netrapalli P (2018) Smoothed analysis for low-rank solutions to semidefinite programs in quadratic penalty form. In: Proceedings of the 31st conference on learning theory, pp 3243–3270
Bordes A, Bottou L, Gallinari P (2009) SGD-QN: careful quasi-Newton stochastic gradient descent. J Mach Learn Res 10(Jul):1737–1754
Borkar VS (2009) Stochastic approximation: a dynamical systems viewpoint, vol 48. Springer, Berlin
Bottou L, Bousquet O (2008) The tradeoffs of large scale learning. In: Advances in neural information processing systems, pp 161–168
Bottou L, Curtis FE, Nocedal J (2018) Optimization methods for large-scale machine learning. SIAM Rev 60(2):223–311
Byrd RH, Hansen SL, Nocedal J, Singer Y (2016) A stochastic quasi-Newton method for large-scale optimization. SIAM J Optim 26(2):1008–1031
Candès EJ, Recht B (2009) Exact matrix completion via convex optimization. Found Comput Math 9(6):717
Cauchy A (1847) Compte rendu des s eances de lacademie des sciences. Comptes Rendus Hebd Seances Acad Sci 21(25):536–538
Choromanska A, Henaff M, Mathieu M, Arous GB, LeCun Y (2015) The loss surfaces of multilayer networks. In: Artificial intelligence and statistics, pp 192–204
Dauphin YN, Pascanu R, Gulcehre C, Cho K, Ganguli S, Bengio Y (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in neural information processing systems, pp 2933–2941
Dekel O, Gilad-Bachrach R, Shamir O, Xiao L (2012) Optimal distributed online prediction using mini-batches. J Mach Learn Res 13(Jan):165–202
Dieuleveut A, Durmus A, Bach F (2017) Bridging the gap between constant step size stochastic gradient descent and markov chains. arXiv preprint arXiv:1707.06386
Du SS, Jin C, Lee JD, Jordan MI, Singh A, Poczos B (2017) Gradient descent can take exponential time to escape saddle points. In: Advances in neural information processing systems, pp 1067–1077
Ge R, Huang F, Jin C, Yuan Y (2015) Escaping from saddle pointsonline stochastic gradient for tensor decomposition. In: Conference on learning theory, pp 797–842
Ge R, Jin C, Zheng Y (2017) No spurious local minima in nonconvex low rank problems: a unified geometric analysis. arXiv preprint arXiv:1704.00708
Hall PM, Marshall AD, Martin RR (1998) Incremental eigenanalysis for classification. In: BMVC, vol 98, Citeseer, pp 286–295
Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Syst Appl 13(4):18–28
Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24(6):417
Jain P, Jin C, Kakade SM, Netrapalli P, Sidford A (2016) Streaming PCA: matching matrix Bernstein and near-optimal finite sample guarantees for Oja’s algorithm. In: Conference on learning theory, pp 1147–1164
Jain P, Kakade SM, Kidambi R, Netrapalli P, Sidford A (2017) Accelerating stochastic gradient descent. arXiv preprint arXiv:1704.08227
Jain P, Kakade SM, Kidambi R, Netrapalli P, Sidford A (2018) Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification. J Mach Learn Res 18(223):1–42. http://jmlr.org/papers/v18/16-595.html
Jin C, Ge R, Netrapalli P, Kakade SM, Jordan MI (2017) How to escape saddle points efficiently. In: Proceedings of the 34th international conference on machine learning, pp 1724–1732
Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in neural information processing systems, pp 315–323
Jolliffe I (2011) Principal component analysis. In: International encyclopedia of statistical science, Springer, pp 1094–1096
Kawaguchi K (2016) Deep learning without poor local minima. In: Advances in neural information processing systems, pp 586–594
Kiefer J, Wolfowitz J (1952) Stochastic estimation of the maximum of a regression function. Ann Math Stat 23(3):462–466
Kushner H, Yin GG (2003) Stochastic approximation and recursive algorithms and applications, vol 35. Springer Science & Business Media, Berlin
Kushner HJ, Clark DS (2012) Stochastic approximation methods for constrained and unconstrained systems, vol 26. Springer Science & Business Media, Berlin
Lan G (2012) An optimal method for stochastic composite optimization. Math Program 133(1–2):365–397
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
LeCun YA, Bottou L, Orr GB, Müller KR (2012) Efficient backprop. In: Neural networks: tricks of the trade, Springer, pp 9–48
Lee JD, Simchowitz M, Jordan MI, Recht B (2016) Gradient descent only converges to minimizers. In: Conference on learning theory, pp 1246–1257
Li CJ, Wang M, Liu H, Zhang T (2018) Near-optimal stochastic approximation for online principal component estimation. Math Program 167(1):75–97
Mei S, Misiakiewicz T, Montanari A, Oliveira RI (2017) Solving SDPS for synchronization and MaxCut problems via the Grothendieck inequality. arXiv preprint arXiv:1703.08729
Moulines E, Bach FR (2011) Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Advances in neural information processing systems, pp 451–459
Nemirovsky AS, Yudin DB (1983) Problem complexity and method efficiency in optimization
Nocedal J, Wright SJ (2006) Numerical optimization 2nd
Oja E (1982) Simplified neuron model as a principal component analyzer. J Math Biol 15(3):267–273
Panageas I, Piliouras G (2016) Gradient descent only converges to minimizers: non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405
Paszke A, Gross S, Chintala S, Chanan G (2017) Pytorch. https://pytorch.org/. Accessed 1 Nov 2018
Polyak BT, Juditsky AB (1992) Acceleration of stochastic approximation by averaging. SIAM J Control Optim 30(4):838–855
Recht B, Re C, Wright S, Niu F (2011) Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in neural information processing systems, pp 693–701
Reddi SJ, Hefny A, Sra S, Poczos B, Smola A (2016) Stochastic variance reduction for nonconvex optimization. In: International conference on machine learning, pp 314–323
Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22(3):400–407
Ross DA, Lim J, Lin RS, Yang MH (2008) Incremental learning for robust visual tracking. Int J Comput Vis 77(1–3):125–141
Schmidt M, Le Roux N, Bach F (2017) Minimizing finite sums with the stochastic average gradient. Math Program 162(1–2):83–112
Schraudolph NN, Yu J, Günter S (2007) A stochastic quasi-Newton method for online convex optimization. In: Artificial intelligence and statistics, pp 436–443
Shalev-Shwartz S, Zhang T (2013) Stochastic dual coordinate ascent methods for regularized loss minimization. J Mach Learn Res 14(Feb):567–599
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol) 58(1):267–288
Tsypkin YZ, Nikolic ZJ (1971) Adaptation and learning in automatic systems, vol 73. Academic Press, New York
Wang X, Ma S, Goldfarb D, Liu W (2017) Stochastic quasi-Newton methods for nonconvex stochastic optimization. SIAM J Optim 27(2):927–956
Weng J, Zhang Y, Hwang WS (2003) Candid covariance-free incremental principal component analysis. IEEE Trans Pattern Anal Mach Intell 25(8):1034–1040
Yann L (1987) Modeles connexionnistes de lapprentissage. Ph.D. thesis, These de Doctorat, Universite Paris 6
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Special issue—Recent Advances in Machine Learning.
Rights and permissions
About this article
Cite this article
Netrapalli, P. Stochastic Gradient Descent and Its Variants in Machine Learning. J Indian Inst Sci 99, 201–213 (2019). https://doi.org/10.1007/s41745-019-0098-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41745-019-0098-4