Optimization for Deep Learning: An Overview


Optimization is a critical component in deep learning. We think optimization for neural networks is an interesting topic for theoretical research due to various reasons. First, its tractability despite non-convexity is an intriguing question and may greatly expand our understanding of tractable problems. Second, classical optimization theory is far from enough to explain many phenomena. Therefore, we would like to understand the challenges and opportunities from a theoretical perspective and review the existing research in this field. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum and then discuss practical solutions including careful initialization, normalization methods and skip connections. Second, we review generic optimization methods used in training neural networks, such as stochastic gradient descent and adaptive gradient methods, and existing theoretical results. Third, we review existing research on the global issues of neural network training, including results on global landscape, mode connectivity, lottery ticket hypothesis and neural tangent kernel.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3


  1. 1.

    While using GD to solve an optimization problem is straightforward, discovering BP is historically nontrivial.

  2. 2.

    If the loss function is not quadratic, but a general loss function \( \ell ( y, h^L ) \), we only need to replace \(e = 2(h^L - y)\) by \(e = \frac{\partial \ell }{ \partial h^L } \).

  3. 3.

    In logic, the statement “every element of the set A belongs to the set B” does not imply the set A is non-empty; if the set A is empty, then the statement always holds. For example, “every dragon on the earth is green” is a correct statement, since no dragon exists.

  4. 4.

    This initialization is sometimes called LeCun initialization, but it appeared first in Page 9 of Bottou [7], as pointed out by Bottou in private communication, so a proper name may be “Bottou-initialization.”

  5. 5.

    Interestingly, ReLU was also popularized by Glorot et al. [12], but they did not apply their own principle to the new neuron ReLU.

  6. 6.

    Note that this difficulty is probably not due to gradient explosion/vanishing, and perhaps related to singularities [19].

  7. 7.

    Note that these two papers also use a certain scalar normalization trick that is much simpler than BatchNorm.

  8. 8.

    Rigorously speaking, the conditions are stronger than realizability (e.g. weak growth condition in [66]). For certain problems such as least squares, realizablity is enough since it implies the weak growth condition in [66].

  9. 9.

    The paper that proposed Adam [85] achieved phenomenal success at least in terms of popularity. It was posted in arxiv on December 2014; by Aug 2019, the number of citations in Google scholar is 26000; by Dec 2019, the number is 33000. Of course the contribution to optimization area cannot just be judged by the number of citations, but the attention Adam received is still quite remarkable.

  10. 10.

    It is not clear how we should call this subarea. Many researchers use “(provable) non-convex optimization” to distinguish these research from convex optimization. However, this name may be confused with the studies of non-convex optimization that focus on the convergence to stationary points. The name “global optimization” might be confused with research on heuristic methods, while GON is mainly theoretical. Anyhow, let us call it global optimization of neural-nets in this article.

  11. 11.

    Again, it is not clear how to call this subarea. “Non-convex optimization” might be a bit confusing to optimizers.

  12. 12.

    There are some recent pruned networks that can be trained from random initial point [145, 146], but the sparsity level is not very high; see [147, Appendix A] for discussions.

  13. 13.

    In this section, we will use “2-layer network” to denote a network like \( y = \phi ( W x + b ) \) or \(y = V^* \phi (W x + b)\) with fixed \(V^*\), and use “1-hidden-layer network” to denote a network like \(y = V \phi (W x + b_1 ) + b_2 \) with both V and W being variables.

  14. 14.

    For batch GD, one epoch is one iteration. For SGD, one epoch consists of multiple stochastic gradient steps that pass all data points once. We do not say “iteration complexity” or “the number of iterations” since per-iteration cost for the vanilla gradient descent and SGD are different and can easily cause confusion. In contrast, the per-epoch cost (number of operations) for batch GD and SGD is comparable.

  15. 15.

    Note that the methods discussed below also improve the rate for convex problems but we skip the discussions.

  16. 16.

    The results for general convex problems are also established, but we discuss a simple case for the ease of presentation. Here, we assume the dimension and the number of samples are both n, and a refined analysis can show the dependence on the two parameters.


  1. 1.

    Bertsekas, D.P.: Nonlinear programming. J. Oper. Res. Soc. 48(3), 334–334 (1997)

    Article  Google Scholar 

  2. 2.

    Sra, S., Nowozin, S., Wright, S.J.: Optimization for Machine Learning. MIT Press, Cambridge (2012)

    Google Scholar 

  3. 3.

    Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)

    MathSciNet  Article  Google Scholar 

  4. 4.

    Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep Learning, vol. 1. MIT press, Cambridge (2016)

    Google Scholar 

  5. 5.

    Jakubovitz, D., Giryes, R., Rodrigues, M.R.D.: Generalization error in deep learning. In: Boche, H., Caire, G., Calderbank, R., Kutyniok, G., Mathar, R. (eds.) Compressed Sensing and Its Applications, pp. 153–193. Springer, Berlin (2019)

    Google Scholar 

  6. 6.

    Shamir, O.: Exponential convergence time of gradient descent for one-dimensional deep linear neural networks (2018). arXiv:1809.08587

  7. 7.

    Bottou, Léon: Reconnaissance de la parole par reseaux connexionnistes. In: Proceedings of neuro Nimes 88, pp. 197–218. Nimes, France (1988). http://leon.bottou.org/papers/bottou-88b

  8. 8.

    LeCun, Y., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient backprop. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade, pp. 9–50. Springer, Berlin (1998)

    Google Scholar 

  9. 9.

    Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)

    MathSciNet  Article  Google Scholar 

  10. 10.

    Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., Bengio, S.: Why does unsupervised pre-training help deep learning? J. Mach. Learn. Res. 11(Feb), 625–660 (2010)

    MathSciNet  MATH  Google Scholar 

  11. 11.

    Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)

  12. 12.

    Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 315–323 (2011)

  13. 13.

    He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015). https://openreview.net/forum?id=rkxQ-nA9FX

  14. 14.

    Mishkin, D., Matas, J.: All you need is a good init (2015). arXiv:1511.06422

  15. 15.

    Saxe, A.M., McClelland, J.L., Ganguli, S.: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks (2013). arXiv:1312.6120

  16. 16.

    Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., Ganguli, S.: Exponential expressivity in deep neural networks through transient chaos. In: Advances in Neural Information Processing Systems, pp. 3360–3368 (2016)

  17. 17.

    Jacot, A., Gabriel, F., Hongler, C.: Neural tangent kernel: convergence and generalization in neural networks. In: Advances in Neural Information Processing Systems, pp. 8571–8580 (2018)

  18. 18.

    Hanin, B., Rolnick, D.: How to start training: the effect of initialization and architecture. In: Advances in Neural Information Processing Systems, pp. 569–579 (2018)

  19. 19.

    Orhan, A.E., Pitkow, X.: Skip connections eliminate singularities (2017). arXiv:1701.09175

  20. 20.

    Pennington, J., Schoenholz, S., Ganguli, S.: Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In: Advances in Neural Information Processing Systems, pp. 4785–4795 (2017)

  21. 21.

    Pennington, J., Schoenholz, S.S., Ganguli, S.: The emergence of spectral universality in deep networks (2018). arXiv:1802.09979

  22. 22.

    Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S.S., Pennington, J.: Dynamical isometry and a mean field theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks (2018). arXiv:1806.05393

  23. 23.

    Li, P., Nguyen, P.-M.: On random deep weight-tied autoencoders: exact asymptotic analysis, phase transitions, and implications to training. In: 7th International Conference on Learning Representations, ICLR 2019 (2019) https://openreview.net/forum?id=HJx54i05tX

  24. 24.

    Gilboa, D., Chang, B., Chen, M., Yang, G., Schoenholz, S.S., Chi, E.H., Pennington, J.: Dynamical isometry and a mean field theory of LSTMs and GRUs (2019). arXiv:1901.08987

  25. 25.

    Dauphin, Y.N., Schoenholz, S.: Metainit: initializing learning by learning to initialize. In: Advances in Neural Information Processing Systems, pp. 12624–12636 (2019)

  26. 26.

    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift (2015). arXiv:1502.03167

  27. 27.

    Santurkar, S., Tsipras, D., Ilyas, A., Madry, A.: How does batch normalization help optimization? In: Advances in Neural Information Processing Systems, pp. 2483–2493 (2018)

  28. 28.

    Bjorck, N., Gomes, C.P., Selman, B., Weinberger, K.Q.: Understanding batch normalization. In: Advances in Neural Information Processing Systems, pp. 7694–7705 (2018)

  29. 29.

    Arora, S., Li, Z., Lyu, K.: Theoretical analysis of auto rate-tuning by batch normalization. In: International Conference on Learning Representations (2019c). https://openreview.net/forum?id=rkxQ-nA9FX

  30. 30.

    Cai, Y., Li, Q., Shen, Z.: A quantitative analysis of the effect of batch normalization on gradient descent. In: International Conference on Machine Learning, pp. 882–890 (2019)

  31. 31.

    Kohler, J., Daneshmand, H., Lucchi, A., Hofmann, T., Zhou, M., Neymeyr, K.: Exponential convergence rates for batch normalization: The power of length-direction decoupling in non-convex optimization. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 806–815 (2019)

  32. 32.

    Ghorbani, B., Krishnan, S., Xiao, Y.: An investigation into neural net optimization via hessian eigenvalue density (2019). arXiv:1901.10159

  33. 33.

    Salimans, T., Kingma, D.P.: Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In: Advances in Neural Information Processing Systems, pp. 901–909 (2016)

  34. 34.

    Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization (2016). arXiv:1607.06450

  35. 35.

    Ulyanov, D., Vedaldi, A., Lempitsky, V.: Instance normalization: the missing ingredient for fast stylization (2016). arXiv:1607.08022

  36. 36.

    Wu, Y., He, K.: Group normalization. In: Proceedings of the European Conference on Computer Vision, pp. 3–19 (2018)

  37. 37.

    Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks (2018). arXiv:1802.05957

  38. 38.

    Luo, P., Zhang, R., Ren, J., Peng, Z., Li, J.: Switchable normalization for learning-to-normalize deep representation. IEEE Trans. Pattern Anal. Mach. Intell. (2019)

  39. 39.

    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

  40. 40.

    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

  41. 41.

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  42. 42.

    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition arXiv:1409.1556 (2014)

  43. 43.

    Srivastava, R.K., Greff, K., Schmidhuber, J.: Highway networks (2015). arXiv:1505.00387

  44. 44.

    Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)

  45. 45.

    Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500 (2017)

  46. 46.

    Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning (2016). arXiv:1611.01578

  47. 47.

    Yu, J., Huang, T.: Network slimming by slimmable networks: towards one-shot architecture search for channel numbers (2019). arXiv:1903.11728

  48. 48.

    Tan, M., Le, Q.V.: Efficientnet: rethinking model scaling for convolutional neural networks (2019). arXiv:1905.11946

  49. 49.

    Hanin, B.: Which neural net architectures give rise to exploding and vanishing gradients? In: Advances in Neural Information Processing Systems, pp. 580–589 (2018)

  50. 50.

    Tarnowski, W., Warchoł, P., Jastrzębski, S., Tabor, J.: Nowak, Maciej: Dynamical isometry is achieved in residual networks in a universal way for any activation function. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2221–2230 (2019)

  51. 51.

    Yang, G., Schoenholz, S.: Mean field residual networks: On the edge of chaos. In: Advances in Neural Information Processing Systems, pp. 7103–7114 (2017)

  52. 52.

    Balduzzi, D., Frean, M., Leary, L., Lewis, J.P., Ma, K.W.-D., McWilliams, B.: The shattered gradients problem: If resnets are the answer, then what is the question? In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 342–350. JMLR. org (2017)

  53. 53.

    Zhang, H., Dauphin, Y.N., Ma, T.: Fixup initialization: residual learning without normalization (2019a). arXiv:1901.09321

  54. 54.

    Curtis, F.E., Scheinberg, K.: Optimization methods for supervised machine learning: from linear models to deep learning. In: Leading Developments from INFORMS Communities, pp. 89–114. INFORMS (2017)

  55. 55.

    Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch sgd: Training imagenet in 1 hour (2017). arXiv:1706.02677

  56. 56.

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

  57. 57.

    Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018). arXiv:1810.04805

  58. 58.

    Gotmare, A., Keskar, N.S., Xiong, C., Socher, R.: A closer look at deep learning heuristics: learning rate restarts, warmup and distillation. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=r14EOsCqKX

  59. 59.

    Smith, L.N.: Cyclical learning rates for training neural networks. In: 2017 IEEE Winter Conference on Applications of Computer Vision, pp. 464–472. IEEE (2017)

  60. 60.

    Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts (2016). arXiv:1608.03983

  61. 61.

    Smith, L.N., Topin, N.: Super-convergence: very fast training of neural networks using large learning rates (2017). arXiv:1708.07120

  62. 62.

    Powell, M.J.D.: Restart procedures for the conjugate gradient method. Math. Program. 12(1), 241–254 (1977)

    MathSciNet  Article  Google Scholar 

  63. 63.

    O’donoghue, B., Candes, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715–732 (2015)

    MathSciNet  Article  Google Scholar 

  64. 64.

    Luo, Z.-Q.: On the convergence of the lms algorithm with adaptive learning rate for linear feedforward networks. Neural Comput. 3(2), 226–245 (1991)

    Article  Google Scholar 

  65. 65.

    Schmidt, M., Roux, L.N.: Fast convergence of stochastic gradient descent under a strong growth condition (2013). arXiv:1308.6370

  66. 66.

    Vaswani, S., Bach, F., Schmidt, M.: Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron (2018). arXiv:1810.07288

  67. 67.

    Liu, C., Belkin, M.: Mass: an accelerated stochastic method for over-parametrized learning (2018b). arXiv:1810.13395

  68. 68.

    Bottou, L.: Online learning and stochastic approximations. On-line Learn.Neural Netw. 17(9), 142 (1998)

    MATH  Google Scholar 

  69. 69.

    Ruder, Sebastian: An overview of gradient descent optimization algorithms (2016). arXiv:1609.04747

  70. 70.

    Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146(1–2), 37–75 (2014)

    MathSciNet  Article  Google Scholar 

  71. 71.

    Devolder, O., Glineur, F., Nesterov, Y., et al.: First-order methods with inexact oracle: the strongly convex case. No. 2013016. Université catholique de Louvain, Center for Operations Research and Econometrics (CORE), 2013

  72. 72.

    Kidambi, R., Netrapalli, P., Jain, P., Kakade, S.: On the insufficiency of existing momentum schemes for stochastic optimization. In: 2018 Information Theory and Applications Workshop (ITA), pp. 1–9. IEEE (2018)

  73. 73.

    Lin, H., Mairal, J., Harchaoui, Z.: A universal catalyst for first-order optimization. In: Advances in Neural Information Processing Systems, pp. 3384–3392 (2015)

  74. 74.

    Allen-Zhu, Z.: Katyusha: the first direct acceleration of stochastic gradient methods. J. Mach. Learn. Res. 18(1), 8194–8244 (2017)

    MathSciNet  MATH  Google Scholar 

  75. 75.

    Defazio, A., Bottou, L.: On the ineffectiveness of variance reduced optimization for deep learning. In: Advances in Neural Information Processing Systems, pp. 1753–1763 (2019)

  76. 76.

    Jain, P., Kakade, S.M., Kidambi, R., Netrapalli, P., Sidford, A.: Accelerating stochastic gradient descent (2017). arXiv:1704.08227

  77. 77.

    Liu, C., Belkin, M.: Accelerating sgd with momentum for over-parameterized learning (2018) arXiv:1810.13395

  78. 78.

    Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Accelerated methods for nonconvex optimization. SIAM J. Optim. 28(2), 1751–1772 (2018)

    MathSciNet  Article  Google Scholar 

  79. 79.

    Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Convex until proven guilty: dimension-free acceleration of gradient descent on non-convex functions. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 654–663 (2017)

  80. 80.

    Xu, Y., Rong, Jing, Y., Tianbao: First-order stochastic algorithms for escaping from saddle points in almost linear time. In: Advances in Neural Information Processing Systems, pp. 5535–5545 (2018)

  81. 81.

    Fang, C., Li, C.J., Lin, Z., Zhang, T.: Spider: near-optimal non-convex optimization via stochastic path-integrated differential estimator. In: Advances in Neural Information Processing Systems, pp. 687–697 (2018)

  82. 82.

    Allen-Zhu, Z.: Natasha 2: faster non-convex optimization than sgd. In: Advances in Neural Information Processing Systems, pp. 2680–2691 (2018)

  83. 83.

    Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(Jul), 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  84. 84.

    Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw. Mach. Learn. 4(2), 26–31 (2012)

    Google Scholar 

  85. 85.

    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv:1412.6980

  86. 86.

    Zeiler, M.D.: Adadelta: an adaptive learning rate method (2012). arXiv:1212.5701

  87. 87.

    Dozat, T., Adam, I. N.: International Conference on Learning Representations. In Workshop (ICLRW) (pp. 1–6). In: Proceedings of Incorporating nesterov momentum into adam (2016)

  88. 88.

    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arXiv:1301.3781

  89. 89.

    Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)

  90. 90.

    Wilson, A.C., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: Advances in Neural Information Processing Systems, pp. 4148–4158 (2017)

  91. 91.

    Keskar, N.S., Socher, R.: Improving generalization performance by switching from adam to sgd (2017). arXiv:1712.07628

  92. 92.

    Sivaprasad, P.T., Mai, F., Vogels, T., Jaggi, M., Fleuret, F.: On the tunability of optimizers in deep learning (2019). arXiv:1910.11758

  93. 93.

    Reddi, S.J., Kale, S., Kumar, S.: On the convergence of adam and beyond. In: International Conference on Learning Representations (2018)

  94. 94.

    Chen, X., Liu, S., Sun, R., Hong, M.: On the convergence of a class of adam-type algorithms for non-convex optimization (2018). arXiv:1808.02941

  95. 95.

    Zhou, D., Tang, Y., Yang, Z., Cao, Y., Gu, Q.: On the convergence of adaptive gradient methods for nonconvex optimization (2018). arXiv:1808.05671

  96. 96.

    Zou, F., Shen, L.: On the convergence of adagrad with momentum for training deep neural networks (2018). arXiv:1808.03408

  97. 97.

    De, S., Mukherjee, A., Ullah, E.: Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration (2018) arXiv:1807.06766

  98. 98.

    Zou, F., Shen, L., Jie, Z., Zhang, We., Liu, W.: A sufficient condition for convergences of adam and rmsprop (2018b). arXiv:1811.09358

  99. 99.

    Ward, R., Wu, X., Bottou, L.: Adagrad stepsizes: sharp convergence over nonconvex landscapes, from any initialization (2018). arXiv:1806.01811

  100. 100.

    Barakat, A., Bianchi, P.: Convergence analysis of a momentum algorithm with adaptive step size for non convex optimization (2019). arXiv:1911.07596

  101. 101.

    Bertsekas, D.P., Tsitsiklis, J.N: Parallel and Distributed Computation: Numerical Methods, vol. 23. Prentice Hall, Englewood Cliffs (1989)

  102. 102.

    Smith, S.L., Kindermans, P.-J., Le, Q.V.: Don’t decay the learning rate, increase the batch size. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=B1Yy1BxCZ

  103. 103.

    Akiba, T., Suzuki, S., Fukuda, K.: Extremely large minibatch sgd: training resnet-50 on imagenet in 15 minutes (2017). arXiv:1711.04325

  104. 104.

    Jia, X., Song, S., He, W., Wang, Y., Rong, H., Zhou, F., Xie, L., Guo, Z., Yang, Y., Yu, L., et al.: Highly scalable deep learning training system with mixed-precision: training imagenet in four minutes (2018). arXiv:1807.11205

  105. 105.

    Mikami, H., Suganuma, H., Tanaka, Y., Kageyama, Y., et al.: Massively distributed sgd: Imagenet/resnet-50 training in a flash (2018). arXiv:1811.05233

  106. 106.

    Ying, C., Kumar, S., Chen, D., Wang, T., Cheng, Y.: Image classification at supercomputer scale (2018). arXiv:1811.06992

  107. 107.

    Yamazaki, M., Kasagi, A., Tabuchi, A., Honda, T., Miwa, M., Fukumoto, N., Tabaru, T., Ike, A., Nakashima, K.: Yet another accelerated sgd: Resnet-50 training on imagenet in 74.7 seconds (2019). arXiv:1903.12650

  108. 108.

    You, Y., Zhang, Z., Hsieh, C.-J., Demmel, J., Keutzer, K.: Imagenet training in minutes. In: Proceedings of the 47th International Conference on Parallel Processing, p. 1. ACM (2018)

  109. 109.

    Yuan, Y.: Step-sizes for the gradient method. AMS IP Stud. Adv. Math. 42(2), 785 (2008)

    MathSciNet  MATH  Google Scholar 

  110. 110.

    Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148 (1988)

    MathSciNet  Article  Google Scholar 

  111. 111.

    Becker, S., Le Cun, Y., et al.: Improving the convergence of back-propagation learning with second order methods. In: Proceedings of the 1988 Connectionist Models Summer School, pp. 29–37 (1988)

  112. 112.

    Bordes, A., Bottou, L., Gallinari, P.: Sgd-qn: Careful quasi-newton stochastic gradient descent. J. Mach. Learn. Res. 10(Jul), 1737–1754 (2009)

    MathSciNet  MATH  Google Scholar 

  113. 113.

    LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient backprop. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade, pp. 9–48. Springer, Berlin (2012)

    Google Scholar 

  114. 114.

    Schaul, T., Zhang, S., LeCun, Y.: No more pesky learning rates. In: International Conference on Machine Learning, pp. 343–351 (2013)

  115. 115.

    Tan, C., Ma, S., Dai, Y.-H., Qian, Y.: Barzilai-borwein step size for stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 685–693 (2016)

  116. 116.

    Orabona, F., Tommasi, T.: Training deep networks without learning rates through coin betting. In: Advances in Neural Information Processing Systems, pp. 2160–2170 (2017)

  117. 117.

    Martens, J.: Deep learning via hessian-free optimization. ICML 27, 735–742 (2010)

    Google Scholar 

  118. 118.

    Pearlmutter, B.A.: Fast exact multiplication by the hessian. Neural Comput. 6(1), 147–160 (1994)

    Article  Google Scholar 

  119. 119.

    Schraudolph, N.N.: Fast curvature matrix–vector products for second-order gradient descent. Neural Comput. 14(7), 1723–1738 (2002)

    Article  Google Scholar 

  120. 120.

    Berahas, A.S., Jahani, M., Takáč, M.: Quasi-newton methods for deep learning: forget the past, just sample (2019). arXiv:1901.09997

  121. 121.

    Amari, S.-I., Park, H., Fukumizu, K.: Adaptive method of realizing natural gradient learning for multilayer perceptrons. Neural Comput. 12(6), 1399–1409 (2000)

    Article  Google Scholar 

  122. 122.

    Martens, J.: New insights and perspectives on the natural gradient method (2014). arXiv:1412.1193

  123. 123.

    Amari, S., Nagaoka, H.: Methods of Information Geometry, vol. 191. American Mathematical Society, Providence (2007)

    Google Scholar 

  124. 124.

    Martens, J., Grosse, R.: Optimizing neural networks with kronecker-factored approximate curvature. In: International Conference on Machine Learning, pp. 2408–2417 (2015)

  125. 125.

    Osawa, K., Tsuji, Y., Ueno, Y., Naruse, A., Yokota, R., Matsuoka, S.: Second-order optimization method for large mini-batch: training resnet-50 on imagenet in 35 epochs (2018). arXiv:1811.12019

  126. 126.

    Anil, R., Gupta, V., Koren, T., Regan, K., Singer, Y.: Second order optimization made practical (2020). arXiv:2002.09018

  127. 127.

    Gupta, V., Koren, T., Singer, Y.: Shampoo: preconditioned stochastic tensor optimization (2018). arXiv:1802.09568

  128. 128.

    Vidal, R., Bruna, J., Giryes, R., Soatto, S.: Mathematics of deep learning (2017). arXiv:1712.04741

  129. 129.

    Lu, C., Deng, Z., Zhou, J., Guo, X.: A sensitive-eigenvector based global algorithm for quadratically constrained quadratic programming. J. Glob. Optim. 73, 1–18 (2019)

    MathSciNet  Article  Google Scholar 

  130. 130.

    Ferreira, O.P., Németh, S.Z.: On the spherical convexity of quadratic functions. J. Glob. Optim. 73(3), 537–545 (2019). https://doi.org/10.1007/s10898-018-0710-6

    MathSciNet  Article  MATH  Google Scholar 

  131. 131.

    Chi, Y., Lu, Y.M., Chen, Y.: Nonconvex optimization meets low-rank matrix factorization: an overview. IEEE Trans. Signal Process. 67(20), 5239–5269 (2019)

    MathSciNet  Article  Google Scholar 

  132. 132.

    Dauphin, Y.N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in Neural Information Processing Systems, pp. 2933–2941 (2014)

  133. 133.

    Goodfellow, I.J., Vinyals, O., Saxe, A.M.: Qualitatively characterizing neural network optimization problems (2014). arXiv:1412.6544

  134. 134.

    Poggio, T., Liao, Q.: Theory II: landscape of the empirical risk in deep learning. PhD thesis, Center for Brains, Minds and Machines (CBMM) (2017). arXiv:1703.09833

  135. 135.

    Li, H., Xu, Z., Taylor, G., Studer, C., Goldstein, T.: Visualizing the loss landscape of neural nets. In: Advances in Neural Information Processing Systems, pp. 6391–6401 (2018b)

  136. 136.

    Baity-Jesi, M., Sagun, L., Geiger, M., Spigler, S., Arous, G.B., Cammarota, C., LeCun, Y., Wyart, M., Biroli, G.: Comparing dynamics: deep neural networks versus glassy systems (2018). arXiv:1803.06969

  137. 137.

    Franz, S., Hwang, S., Urbani, P.: Jamming in multilayer supervised learning models (2018). arXiv:1809.09945

  138. 138.

    Geiger, M., Spigler, S., d’Ascoli, S., Sagun, L., Baity-Jesi, M., Biroli, G., Wyart, M.: The jamming transition as a paradigm to understand the loss landscape of deep neural networks (2018). arXiv:1809.09349

  139. 139.

    Draxler, F., Veschgini, K., Salmhofer, M., Hamprecht, F.A.: Essentially no barriers in neural network energy landscape (2018) arXiv:1803.00885

  140. 140.

    Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D.P., Wilson, A.G.: Loss surfaces, mode connectivity, and fast ensembling of DNNS. In: Advances in Neural Information Processing Systems, pp. 8789–8798 (2018)

  141. 141.

    Freeman, C.D., Bruna, J.: Topology and geometry of half-rectified network optimization (2016). arXiv:1611.01540

  142. 142.

    Nguyen, Q.: On connected sublevel sets in deep learning (2019b). arXiv:1901.07417

  143. 143.

    Kuditipudi, R., Wang, X., Lee, H., Zhang, Y., Li, Z., Hu, W., Arora, S., Ge, R.: Explaining landscape connectivity of low-cost solutions for multilayer nets (2019). arXiv:1906.06247

  144. 144.

    Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding (2015). arXiv:1510.00149

  145. 145.

    Liu, Z., Sun, M., Zhou, T., Huang, G., Darrell, T.: Rethinking the value of network pruning (2018). arXiv:1810.05270

  146. 146.

    Lee, N., Ajanthan, T., Torr, P.: SNIP: single-shot network pruning based on connection sensitivity. In: International Conference on Learning Representations (2019b). https://openreview.net/forum?id=B1VZqjAcYX

  147. 147.

    Frankle, J., Dziugaite, G.K., Roy, D.M., Carbin, M.: The lottery ticket hypothesis at scale (2019). arXiv:1903.01611

  148. 148.

    Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks (2018). arXiv:1803.03635

  149. 149.

    Zhou, H., Lan, J., Liu, R., Yosinski, J.: Deconstructing lottery tickets: zeros, signs, and the supermask (2019). arXiv:1905.01067

  150. 150.

    Morcos, A.S., Yu, H., Paganini, M., Tian, Y.: One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers (2019). arXiv:1906.02773

  151. 151.

    Tian, Y., Jiang, T., Gong, Q., Morcos, A.: Luck matters: Understanding training dynamics of deep relu networks (2019). arXiv:1905.13405

  152. 152.

    Hochreiter, S., Schmidhuber, J.: Flat minima. Neural Comput. 9(1), 1–42 (1997)

    Article  Google Scholar 

  153. 153.

    Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-batch training for deep learning: generalization gap and sharp minima (2016). arXiv:1609.04836

  154. 154.

    Dinh, L., Pascanu, R., Bengio, S., Bengio, Y.: Sharp minima can generalize for deep nets. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 1019–1028 (2017)

  155. 155.

    Neyshabur, B., Salakhutdinov, R.R., Srebro, N.: Path-sgd: Path-normalized optimization in deep neural networks. In: Advances in Neural Information Processing Systems, pp. 2422–2430 (2015)

  156. 156.

    Yi, M., Meng, Q., Chen, W., Ma, Z., Liu, T.-Y.: Positively scale-invariant flatness of relu neural networks (2019). arXiv:1903.02237

  157. 157.

    He, H., Huang, G., Yuan, Y.: Asymmetric valleys: beyond sharp and flat local minima (2019). arXiv:1902.00744

  158. 158.

    Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., Zecchina, R.: Entropy-sgd: Biasing gradient descent into wide valleys (2016). arXiv:1611.01838

  159. 159.

    Kawaguchi, K.: Deep learning without poor local minima. In: Advances in Neural Information Processing Systems, pp. 586–594 (2016)

  160. 160.

    Lu, H., Kawaguchi, K.: Depth creates no bad local minima (2017). arXiv:1702.08580

  161. 161.

    Laurent, T., Brecht, J.: Deep linear networks with arbitrary loss: all local minima are global. In: International Conference on Machine Learning, pp. 2908–2913 (2018)

  162. 162.

    Nouiehed, M., Razaviyayn, M.: Learning deep models: critical points and local openness (2018). arXiv:1803.02968

  163. 163.

    Zhang, L.: Depth creates no more spurious local minima (2019). arXiv:1901.09827

  164. 164.

    Yun, C., Sra, S., Jadbabaie, A.: Global optimality conditions for deep neural networks (2017). arXiv:1707.02444

  165. 165.

    Zhou, Y., Liang, Y.: Critical points of linear neural networks: analytical forms and landscape properties (2018) arXiv: 1710.11205

  166. 166.

    Livni, R., Shalev-Shwartz, S., Shamir, O.: On the computational efficiency of training neural networks. In: Advances in Neural Information Processing Systems, pp. 855–863 (2014)

  167. 167.

    Neyshabur, B., Bhojanapalli, S., McAllester, D., Srebro, N.: Exploring generalization in deep learning. In: Advances in Neural Information Processing Systems, pp. 5947–5956 (2017)

  168. 168.

    Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization (2016). arXiv:1611.03530

  169. 169.

    Nguyen, Q., Mukkamala, M.C., Hein, M.: On the loss landscape of a class of deep neural networks with no bad local valleys (2018). arXiv:1809.10749

  170. 170.

    Li, Dawei, D., Tian, S., Ruoyu: Over-parameterized deep neural networks have no strict local minima for any continuous activations (2018a). arXiv:1812.11039

  171. 171.

    Yu, X., Pasupathy, S.: Innovations-based MLSE for Rayleigh flat fading channels. IEEE Trans. Commun. 43, 1534–1544 (1995)

    Article  Google Scholar 

  172. 172.

    Ding, T., Li, D., Sun, R.: Sub-optimal local minima exist for almost all over-parameterized neural networks. Optimization Online (2019) arXiv: 1911.01413

  173. 173.

    Bartlett, P.L., Foster, D.J., Telgarsky, M.J.: Spectrally-normalized margin bounds for neural networks. In: Advances in Neural Information Processing Systems, pp. 6240–6249 (2017)

  174. 174.

    Wei, C., Lee, J.D., Liu, Q., Ma, T.: On the margin theory of feedforward neural networks (2018). arXiv:1810.05369

  175. 175.

    Wu, L., Zhu, Z., et al.: Towards understanding generalization of deep learning: perspective of loss landscapes (2017). arXiv:1706.10239

  176. 176.

    Belkin, M., Hsu, D., Ma, S., Mandal, S.: Reconciling modern machine learning and the bias-variance trade-off (2018). arXiv:1812.11118

  177. 177.

    Mei, S., Montanari, A.: The generalization error of random features regression: precise asymptotics and double descent curve (2019). arXiv:1908.05355

  178. 178.

    Liang, S., Sun, R., Lee, J.D., Srikant, R.: Adding one neuron can eliminate all bad local minima. In: Advances in Neural Information Processing Systems, pp. 4355–4365 (2018a)

  179. 179.

    Kawaguchi, K., Kaelbling, L.P.: Elimination of all bad local minima in deep learning (2019). arXiv:1901.00279

  180. 180.

    Liang, S., Sun, R., Srikant, R.: Revisiting landscape analysis in deep neural networks: eliminating decreasing paths to infinity (2019). arXiv:1912.13472

  181. 181.

    Shalev-Shwartz, S., Shamir, O., Shammah, S.: Failures of gradient-based deep learning. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 3067–3075. JMLR. org (2017)

  182. 182.

    Swirszcz, G., Czarnecki, W.M., Pascanu, R.: Local minima in training of deep networks (2016). arXiv:1611.06310

  183. 183.

    Zhou, Y., Liang, Y.: Critical points of neural networks: analytical forms and landscape properties (2017). arXiv:1710.11205

  184. 184.

    Safran, I., Shamir, O.: Spurious local minima are common in two-layer relu neural networks (2017). arXiv:1712.08968

  185. 185.

    Venturi, L., Bandeira, A., Bruna, J.: Spurious valleys in two-layer neural network optimization landscapes (2018b). arXiv:1802.06384

  186. 186.

    Liang, S., Sun, R., Li, Y., Srikant, R.: Understanding the loss surface of neural networks for binary classification (2018b). arXiv:1803.00909

  187. 187.

    Yun, C., Sra, S., Jadbabaie, A.: Small nonlinearities in activation functions create bad local minima in neural networks (2018). arXiv:1802.03487

  188. 188.

    Bartlett, P., Helmbold, D., Long, P.: Gradient descent with identity initialization efficiently learns positive definite linear transformations. In: International Conference on Machine Learning, pp. 520–529 (2018)

  189. 189.

    Arora, S., Cohen, N., Golowich, N., Hu, W.: A convergence analysis of gradient descent for deep linear neural networks (2018). arXiv:1810.02281

  190. 190.

    Ji, Z., Telgarsky, M.: Gradient descent aligns the layers of deep linear networks (2018). arXiv:1810.02032

  191. 191.

    Du, S.S., Lee, J.D., Li, H., Wang, L., Zhai, X.: Gradient descent finds global minima of deep neural networks (2018). arXiv:1811.03804

  192. 192.

    Yang, G.: Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation (2019). arXiv:1902.04760

  193. 193.

    Novak, R., Xiao, L., Bahri, Y., Lee, J., Yang, G., Abolafia, D.A., Pennington, J., Sohl-dickstein, J.: Bayesian deep convolutional networks with many channels are gaussian processes. In: International Conference on Learning Representations (2019a). https://openreview.net/forum?id=B1g30j0qF7

  194. 194.

    Allen-Zhu, Z., Li, Y., Song, Z.: A convergence theory for deep learning via over-parameterization (2018). arXiv:1811.03962

  195. 195.

    Zou, D., Cao, Y., Zhou, D., Gu, Q.: Stochastic gradient descent optimizes over-parameterized deep relu networks (2018a). arXiv:1811.08888

  196. 196.

    Li, Y., Liang, Y.: Learning overparameterized neural networks via stochastic gradient descent on structured data. In: Advances in Neural Information Processing Systems, pp. 8168–8177 (2018)

  197. 197.

    Arora, S., Du, S.S., Hu, W., Li, Z., Salakhutdinov, R., Wang, R.: On exact computation with an infinitely wide neural net (2019a). arXiv:1904.11955

  198. 198.

    Zhang, H., Yu, D., Chen, W., Liu, T.-Y.: Training over-parameterized deep resnet is almost as easy as training a two-layer network (2019b). arXiv:1903.07120

  199. 199.

    Ma, C., Wu, L., et al.: Analysis of the gradient descent algorithm for a deep neural network model with skip-connections (2019). arXiv:1904.05263

  200. 200.

    Li, Z., Wang, R., Yu, D., Du, S.S., Hu, W., Salakhutdinov, R., Arora, S.: Enhanced convolutional neural tangent kernels (2019) arXiv:1806.05393

  201. 201.

    Arora, S., Du, S.S., Li, Z., Salakhutdinov, R., Wang, R., Yu, D.: Harnessing the power of infinitely wide deep nets on small-data tasks (2019b). arXiv:1910.01663

  202. 202.

    Novak, R., Xiao, L., Hron, J., Lee, J., Alemi, A.A., Sohl-Dickstein, J., Schoenholz, S.S.: Neural tangents: Fast and easy infinite neural networks in python (2019b). arXiv:1912.02803

  203. 203.

    Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J., Pennington, J.: Wide neural networks of any depth evolve as linear models under gradient descent. In: Advances in Neural Information Processing Systems, pp. 8570–8581 (2019a)

  204. 204.

    Sirignano, J., Spiliopoulos, K.: Mean field analysis of deep neural networks (2019). arXiv:1903.04440

  205. 205.

    Araujo, D., Oliveira, R.I., Yukimura, D.: A mean-field limit for certain deep neural networks (2019) arXiv:1906.00193

  206. 206.

    Nguyen, P.-M.: Mean field limit of the learning dynamics of multilayer neural networks (2019a). arXiv:1902.02880

  207. 207.

    Mei, S., Montanari, A., Nguyen, P.-M.: A mean field view of the landscape of two-layers neural networks (2018). arXiv:1804.06561

  208. 208.

    Sirignano, J., Spiliopoulos, K.: Mean field analysis of neural networks (2018). arXiv:1805.01053

  209. 209.

    Rotskoff, G.M., Vanden-Eijnden, E.: Neural networks as interacting particle systems: asymptotic convexity of the loss landscape and universal scaling of the approximation error (2018). arXiv:1805.00915

  210. 210.

    Chizat, L., Oyallon, E., Bach, F.: On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Advances in Neural Information Processing Systems, pp. 3040–3050 (2018)

  211. 211.

    Williams, F., Trager, M., Silva, C., Panozzo, D., Zorin, D., Bruna, J.: Gradient dynamics of shallow univariate relu networks In: Advances in Neural Information Processing Systems, pp. 8376–8385 (2019)

  212. 212.

    Venturi, L., Bandeira, A., Bruna, J.: Neural networks with finite intrinsic dimension have no spurious valleys (2018a). arXiv:1802.06384. 15

  213. 213.

    Haeffele, B.D., Vidal, R.: Global optimality in neural network training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7331–7339 (2017)

  214. 214.

    Burer, S., Monteiro, R.D.C.: Local minima and convergence in low-rank semidefinite programming. Math. Program. 103(3), 427–444 (2005)

    MathSciNet  Article  Google Scholar 

  215. 215.

    Ge, R., Lee, J.D., Ma, T.: Learning one-hidden-layer neural networks with landscape design (2017). arXiv:1711.00501

  216. 216.

    Gao, W., Makkuva, A.V., Oh, S., Viswanath, P.: Learning one-hidden-layer neural networks under general input distributions (2018). arXiv:1810.04133

  217. 217.

    Feizi, S., Javadi, H., Zhang, J., Tse, D.: Porcupine neural networks: (almost) all local optima are global (2017). arXiv:1710.02196

  218. 218.

    Panigrahy, R., Rahimi, A., Sachdeva, S., Zhang, Q.: Convergence results for neural networks via electrodynamics (2017). arXiv:1702.00458

  219. 219.

    Soltanolkotabi, M., Javanmard, A., Lee, J.D.: Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Trans. Inf. Theory 65(2), 742–769 (2019)

    MathSciNet  Article  Google Scholar 

  220. 220.

    Soudry, D., Hoffer, E.: Exponentially vanishing sub-optimal local minima in multilayer neural networks (2017). arXiv:1702.05777

  221. 221.

    Laurent, T., von Brecht, J.: The multilinear structure of relu networks (2017). arXiv:1712.10132

  222. 222.

    Tian, Y.: An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 3404–3413. JMLR. org (2017)

  223. 223.

    Brutzkus, A., Globerson, A.: Globally optimal gradient descent for a convnet with Gaussian inputs. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 605–614 (2017)

  224. 224.

    Zhong, K., Song, Z., Jain, P., Bartlett, P.L., Dhillon, I.S.: Recovery guarantees for one-hidden-layer neural networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 4140–4149 (2017)

  225. 225.

    Li, Y., Yuan, Y.: Convergence analysis of two-layer neural networks with relu activation. In: Advances in Neural Information Processing Systems, pp. 597–607 (2017)

  226. 226.

    Brutzkus, A., Globerson, A., Malach, E., Shalev-Shwartz, S.: Sgd learns over-parameterized networks that provably generalize on linearly separable data. International Conference on Learning Representations (2018)

  227. 227.

    Wang, G., Giannakis, G.B., Chen, J.: Learning relu networks on linearly separable data: algorithm, optimality, and generalization (2018). arXiv:1808.04685

  228. 228.

    Zhang, X., Yu, Y., Wang, L., Gu, Q.: Learning one-hidden-layer relu networks via gradient descent (2018). arXiv:1806.07808

  229. 229.

    Du, S.S., Lee, J.D.: On the power of over-parametrization in neural networks with quadratic activation (2018). arXiv:1803.01206

  230. 230.

    Oymak, S., Soltanolkotabi, M.: Towards moderate overparameterization: global convergence guarantees for training shallow neural networks (2019). arXiv:1902.04674

  231. 231.

    Su, L., Yang, P.: On learning over-parameterized neural networks: a functional approximation prospective. In: Advances in Neural Information Processing Systems pp. 2637–2646 (2019)

  232. 232.

    Janzamin, M., Sedghi, H., Anandkumar, A.: Beating the perils of non-convexity: guaranteed training of neural networks using tensor methods (2015). arXiv:1506.08473

  233. 233.

    Mondelli, M., Montanari, A.: On the connection between learning two-layers neural networks and tensor decomposition (2018). arXiv:1802.07301

  234. 234.

    Boob, D., Lan, G.: Theoretical properties of the global optimizer of two layer neural network (2017). arXiv:1710.11241

  235. 235.

    Du, S.S., Lee, J.D., Tian, Y., Poczos, B., Singh, A.: Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima (2017). arXiv:1712.00779

  236. 236.

    Vempala, S., Wilmes, J.: Polynomial convergence of gradient descent for training one-hidden-layer neural networks (2018). arXiv:1805.02677

  237. 237.

    Ge, R., Kuditipudi, R., Li, Z., Wang, X.: Learning two-layer neural networks with symmetric inputs (2018). arXiv:1810.06793

  238. 238.

    Oymak, S., Soltanolkotabi, M.: Overparameterized nonlinear learning: Gradient descent takes the shortest path? (2018). arXiv:1812.10004

  239. 239.

    Ju, S.: List of works on “provable nonconvex methods/algorithms”. https://sunju.org/research/nonconvex/

  240. 240.

    Leventhal, D., Lewis, A.S.: Randomized methods for linear constraints: convergence rates and conditioning. Math. Oper. Res. 35(3), 641–654 (2010)

    MathSciNet  Article  Google Scholar 

  241. 241.

    Nesterov, Y.: Efficiency of coordiate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)

    MathSciNet  Article  Google Scholar 

  242. 242.

    Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)

  243. 243.

    Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems, pp. 1646–1654 (2014)

  244. 244.

    Wright, S., Nocedal, J.: Numerical optimization. Science 35(67–68), 7 (1999)

    MATH  Google Scholar 

Download references


We would like to thank Leon Bottou, Yann LeCun, Yann Dauphin, Yuandong Tian, Mark Tygert, Levent Sagun, Lechao Xiao, Tengyu Ma, Jason Lee, Matus Telgarsky, Ju Sun, Wei Hu, Simon Du, Lei Wu, Quanquan Gu, Justin Sirignano, Tian Ding, Dawei Li, Shiyu Liang, R. Srikant, for discussions on various results reviewed in this article. We thank Rob Bray for proof-reading a part of this article. We thank Ju Sun for the list of related works in the webpage [239] which helps the writing of this article. We thank Zai-Wen Wen for the invitation of writing this article.

Author information



Corresponding author

Correspondence to Ruo-Yu Sun.

Review of Large-Scale (Convex) Optimization Methods

Review of Large-Scale (Convex) Optimization Methods

In this subsection, we review several methods in large-scale optimization that are closely related to deep learning.

Since the neural network optimization problem is often of huge size (at least millions of optimization variables and millions of samples), a method that directly inverts a matrix in an iteration, such as Newton method, is often considered impractical. Thus, we will focus on first-order methods, i.e., iterative algorithms that mainly use gradient information (though we will briefly discuss quasi-Newton methods).

To unify these methods in one framework, we start with the common convergence rate results of GD method and explain how different methods improve the convergence rate in different ways. Consider the prototype convergence rate result in convex optimization: the epoch-complexityFootnote 14 is \(O( \kappa \log 1/\varepsilon )\) or \(O( \beta /\varepsilon )\). These rates mean the following: to achieve \(\varepsilon \) error, the number of epochs to achieve error \(\varepsilon \) is no more than \( \kappa \log 1/\varepsilon \) for strongly convex problems (or \(\beta /\varepsilon \) for convex problems), where \(\kappa \) is the condition number of the problem (and \(\beta \) is the maximal Lipschitz constant of all gradients).

There are at least four classes of methods that can improve the convergence rate \(O( \kappa \log 1/\varepsilon )\) for strongly convex quadratic problems.Footnote 15

The first class of methods are based on decomposition, i.e., decomposing a large problem into smaller ones. Typical methods including SGD and coordinate descent (CD). The theoretical benefit is relatively well understood for CD, and somewhat well understood for SGD-type methods. A simple argument of the benefit [3] is the following: if all training samples are the same, then the gradient for one sample is proportional to the gradient for all samples, thus one iteration of SGD gives the same update as one iteration of GD; since one iteration of GD takes n-times more computation cost than one iteration of SGD, thus GD is n times slower than SGD. Below we discuss more precise convergence rate results that illustrate the benefit of CD and SGD. For an unconstrained n-dimensional convex quadratic problem with all diagonal entries being 1Footnote 16:

  • Randomized CD has an epoch-complexity \(O(\kappa _{\mathrm CD} \log 1/\varepsilon )\) [240, 241], where \(\kappa _{\mathrm CD}\) is the ratio of the average eigenvalue \(\lambda _{\text {avg}}\) over the minimum eigenvalue \(\lambda _{\min }\) of the coefficient matrix. This is smaller than \(O(\kappa _{\mathrm CD} \log 1/\varepsilon )\) by a factor of \( \lambda _{\max }/\lambda _{\text {avg}} \) where \(\lambda _{\max } \) is the maximum eigenvalue. Clearly, the improvement ratio \( \lambda _{\max }/\lambda _{\text {avg}} \) lies in [1, n], thus randomized CD is 1 to n times faster than GD.

  • For SGD-type methods, very similar acceleration can be proved. Recent variants of SGD (such as SVRG [242] and SAGA [243]) achieve the same convergence rate as R-CD for the equal-diagonal quadratic problems (though not pointed out in the literature) and are also 1 to n times faster than GD. We highlight that this up-to-n-factor acceleration has been the major focus of recent studies of SGD-type methods and has achieved much attention in theoretical machine learning area.

  • Classical theory of vanilla SGD [3] often uses diminishing stepsize and thus does not enjoy the same benefit as SVRG and SAGA; however, as mentioned before, constant step-size SGD works quite well in many machine learning problems, and in these problems SGD may have inherited the same advantage of SVRG and SAGA.

The second class of methods is fast gradient methods (FGM) that have convergence rate \(O( \sqrt{ \kappa } \log 1/\varepsilon )\), thus saving a factor of \(\sqrt{\kappa }\) compared to the convergence rate of GD \(O( \kappa \log 1/\varepsilon )\). FGM includes conjugate gradient method, heavy ball method and accelerated gradient method. For quadratic problems, these three methods all achieve the improved rate \(O( \sqrt{ \kappa } \log 1/\varepsilon )\). For general strongly convex problems, only accelerated gradient method is known to achieve the rate \(O( \sqrt{ \kappa } \log 1/\varepsilon )\).

The third class of methods utilizes the second-order information of the problem, including quasi-Newton method and Barzilai–Borwein method. Quasi-Newton methods such as BFGS and limited BFGS (see, e.g., [244]) use an approximation of the Hessian in each epoch and are popular choices for many nonlinear optimization problems. Barzilai–Borwein (BB) method uses a diagonal matrix estimation of the Hessian and can also be viewed as GD with a special stepsize that approximates the curvature of the problem. It seems very difficult to theoretically justify the advantage of these methods over GD, but intuitively, the convergence speed of these methods relies much less on the condition number \(\kappa \) (or any variant of the condition number such as \(\kappa _{\mathrm CD}\)). A rigorous time complexity analysis of any method in this class, even for unconstrained quadratic problems, remains largely open.

The fourth class of methods is parallel computation methods, which can be combined with aforementioned three classes of ingredients. As discussed in the classical book [101], GD is a special case of Jacobi method which is naturally parallelizable, and CD is a Gauss-Seidel-type method which may require some extra trick to parallize. For example, for minimizing an n-dimensional least square problem, each epoch of GD mainly requires a matrix-vector product which is parallelizable. More specifically, while a serial model takes \(O(n^2)\) time steps to perform a matrix-vector product, a parallel model can take as small as \(O(\log n )\) operations. For CD or SGD, each iteration consists of one or a few vector–vector products, and each vector-vector product is parallelizable. Multiple iterations in one epoch of CD or SGD may not be parallellizable in the worst case (e.g. a dense coefficient matrix), but when the problem exhibits some sparsity (e.g. the coefficient matrix is sparse), they can be partially parallelizable. The above discussion seems to show that “batch” methods such as GD might be faster than CD or SGD in a parallel setting; however, it is an over-simplified discussion, and many other factors such as synchronization cost and the communication burden can greatly affect the performance. In general, parallel computation in numerical optimization is quite complicated, which is why the whole book [101] is devoted to this topic.

We briefly summarize the benefits of these methods as below. For minimizing n-dimensional quadratic functions (with equal diagonal entries), the benchmark GD takes time \(O( n^2 \kappa \log 1/\varepsilon )\) to achieve error \(\varepsilon \). The first class (e.g. accelerated gradient method) improves it to \(O( n^2 \sqrt{ \kappa } \log 1/\varepsilon )\), the second class (e.g. CD and SVRG) improves it to \(O( n^2 \kappa _{\mathrm CD} \log 1/\varepsilon )\), the third class (e.g. BFGS and BB) may improve \(\kappa \) to some other unknown parameter, and the fourth class (parallel computation) can potentially improve it to \(O( \kappa \log n \log 1/\varepsilon )\) with extra cost such as communication. Although we treat these methods as separate classes, researchers have extensively studied various mixed methods of two or more classes, though the theoretical analysis can be much harder. Even just for quadratic problems, the theoretical analysis cannot fully predict the practical behavior of these algorithms or their mixtures, but these results provide quite useful understanding. For general convex and non-convex problems, some part of the above theoretical analysis can still hold, but there are still many unknown questions.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sun, R. Optimization for Deep Learning: An Overview. J. Oper. Res. Soc. China 8, 249–294 (2020). https://doi.org/10.1007/s40305-020-00309-6

Download citation


  • Deep learning
  • Non-convex optimization
  • Neural networks
  • Convergence
  • Landscape

Mathematics Subject Classification

  • 90C30
  • 68Q32