Training Deep Neural Networks



The procedure for training neural networks with backpropagation is briefly introduced in Chapter  1 This chapter will expand on the description on Chapter  1 in several ways


  1. [7]
    R. Ahuja, T. Magnanti, and J. Orlin. Network flows: Theory, algorithms, and applications. Prentice Hall, 1993.Google Scholar
  2. [13]
    J. Ba and R. Caruana. Do deep nets really need to be deep? NIPS Conference, pp. 2654–2662, 2014.Google Scholar
  3. [14]
    J. Ba, J. Kiros, and G. Hinton. Layer normalization. arXiv:1607.06450, 2016.
  4. [23]
    M. Bazaraa, H. Sherali, and C. Shetty. Nonlinear programming: theory and algorithms. John Wiley and Sons, 2013.Google Scholar
  5. [24]
    S. Becker, and Y. LeCun. Improving the convergence of back-propagation learning with second order methods. Proceedings of the 1988 connectionist models summer school, pp. 29–37, 1988.Google Scholar
  6. [36]
    J. Bergstra, R. Bardenet, Y. Bengio, and B. Kegl. Algorithms for hyper-parameter optimization. NIPS Conference, pp. 2546–2554, 2011.Google Scholar
  7. [37]
    J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13, pp. 281–305, 2012.MathSciNetzbMATHGoogle Scholar
  8. [38]
    J. Bergstra, D. Yamins, and D. Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. ICML Confererence, pp. 115–123, 2013.Google Scholar
  9. [39]
    D. Bertsekas. Nonlinear programming Athena Scientific, 1999.Google Scholar
  10. [41]
    C. M. Bishop. Neural networks for pattern recognition. Oxford University Press, 1995.Google Scholar
  11. [42]
    C. M. Bishop. Bayesian Techniques. Chapter 10 in “Neural Networks for Pattern Recognition,” pp. 385–439, 1995.Google Scholar
  12. [54]
    A. Bryson. A gradient method for optimizing multi-stage allocation processes. Harvard University Symposium on Digital Computers and their Applications, 1961.Google Scholar
  13. [55]
    C. Bucilu, R. Caruana, and A. Niculescu-Mizil. Model compression. ACM KDD Conference, pp. 535–541, 2006.Google Scholar
  14. [66]
    W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen. Compressing neural networks with the hashing trick. ICML Confererence, pp. 2285–2294, 2015.Google Scholar
  15. [74]
    A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, and B. Catanzaro. Deep learning with COTS HPC systems. ICML Confererence, pp. 1337–1345, 2013.Google Scholar
  16. [81]
    T. Cooijmans, N. Ballas, C. Laurent, C. Gulcehre, and A. Courville. Recurrent batch normalization. arXiv:1603.09025, 2016.
  17. [88]
    Y. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NIPS Conference, pp. 2933–2941, 2014.Google Scholar
  18. [91]
    J. Dean et al. Large scale distributed deep networks. NIPS Conference, 2012.Google Scholar
  19. [94]
    M. Denil, B. Shakibi, L. Dinh, M. A. Ranzato, and N. de Freitas. Predicting parameters in deep learning. NIPS Conference, pp. 2148–2156, 2013.Google Scholar
  20. [96]
    G. Desjardins, K. Simonyan, and R. Pascanu. Natural neural networks. NIPS Congference, pp. 2071–2079, 2015.Google Scholar
  21. [98]
    T. Dettmers. 8-bit approximations for parallelism in deep learning. arXiv:1511.04561, 2015.
  22. [108]
    J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, pp. 2121–2159, 2011.MathSciNetzbMATHGoogle Scholar
  23. [133]
    H. Gavin. The Levenberg-Marquardt method for nonlinear least squares curve-fitting problems, 2011.
  24. [140]
    X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. AISTATS, pp. 249–256, 2010.Google Scholar
  25. [141]
    X. Glorot, A. Bordes, and Y. Bengio. Deep Sparse Rectifier Neural Networks. AISTATS, 15(106), 2011.Google Scholar
  26. [146]
    I. Goodfellow, O. Vinyals, and A. Saxe. Qualitatively characterizing neural network optimization problems. arXiv:1412.6544, 2014. [Also appears in International Conference in Learning Representations, 2015]
  27. [148]
    I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. arXiv:1302.4389, 2013.Google Scholar
  28. [167]
    R. Hahnloser and H. S. Seung. Permitted and forbidden sets in symmetric threshold-linear networks. NIPS Conference, pp. 217–223, 2001.Google Scholar
  29. [168]
    S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and W. Dally. EIE: Efficient Inference Engine for Compressed Neural Network. ACM SIGARCH Computer Architecture News, 44(3), pp. 243–254, 2016.CrossRefGoogle Scholar
  30. [169]
    S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural networks. NIPS Conference, pp. 1135–1143, 2015.Google Scholar
  31. [171]
    M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. ICML Confererence, pp. 1225–1234, 2006.Google Scholar
  32. [183]
    K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. IEEE International Conference on Computer Vision, pp. 1026–1034, 2015.Google Scholar
  33. [184]
    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.Google Scholar
  34. [189]
    M. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 49(6), 1952.Google Scholar
  35. [194]
    G. Hinton. Neural networks for machine learning, Coursera Video, 2012.Google Scholar
  36. [202]
    G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. NIPS Workshop, 2014.Google Scholar
  37. [203]
    R. Hochberg. Matrix Multiplication with CUDA: A basic introduction to the CUDA programming model. Unpublished manuscript, 2012.
  38. [205]
    S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, A Field Guide to Dynamical Recurrent Neural Networks, IEEE Press, 2001.Google Scholar
  39. [213]
    F. Iandola, S. Han, M. Moskewicz, K. Ashraf, W. Dally, and K. Keutzer. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv:1602.07360, 2016.
  40. [214]
    S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167, 2015.Google Scholar
  41. [217]
    R. Jacobs. Increased rates of convergence through learning rate adaptation. Neural Networks, 1(4), pp. 295–307, 1988.CrossRefGoogle Scholar
  42. [221]
    K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recognition? International Conference on Computer Vision (ICCV), 2009.Google Scholar
  43. [237]
    H. J. Kelley. Gradient theory of optimal flight paths. Ars Journal, 30(10), pp. 947–954, 1960.CrossRefGoogle Scholar
  44. [241]
    D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
  45. [254]
    A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997, 2014.
  46. [255]
    A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. NIPS Conference, pp. 1097–1105. 2012.Google Scholar
  47. [273]
    Q. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Ng, On optimization methods for deep learning. ICML Conference, pp. 265–272, 2011.Google Scholar
  48. [278]
    Y. LeCun, L. Bottou, G. Orr, and K. Muller. Efficient backprop. in G. Orr and K. Muller (eds.) Neural Networks: Tricks of the Trade, Springer, 1998.Google Scholar
  49. [282]
    Y. LeCun, J. Denker, and S. Solla. Optimal brain damage. NIPS Conference, pp. 598–605, 1990.Google Scholar
  50. [300]
    D. Luenberger and Y. Ye. Linear and nonlinear programming, Addison-Wesley, 1984.Google Scholar
  51. [306]
    D. J. MacKay. A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3), pp. 448–472, 1992.CrossRefGoogle Scholar
  52. [313]
    J. Martens. Deep learning via Hessian-free optimization. ICML Conference, pp. 735–742, 2010.Google Scholar
  53. [314]
    J. Martens and I. Sutskever. Learning recurrent neural networks with hessian-free optimization. ICML Conference, pp. 1033–1040, 2011.Google Scholar
  54. [315]
    J. Martens, I. Sutskever, and K. Swersky. Estimating the hessian by back-propagating curvature. arXiv:1206.6464, 2016.
  55. [316]
    J. Martens and R. Grosse. Optimizing Neural Networks with Kronecker-factored Approximate Curvature. ICML Conference, 2015.Google Scholar
  56. [324]
    T. Mikolov. Statistical language models based on neural networks. Ph.D. thesis, Brno University of Technology, 2012.Google Scholar
  57. [330]
    M. Minsky and S. Papert. Perceptrons. An Introduction to Computational Geometry, MIT Press, 1969.Google Scholar
  58. [353]
    Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1∕k 2). Soviet Mathematics Doklady, 27, pp. 372–376, 1983.zbMATHGoogle Scholar
  59. [359]
    J. Nocedal and S. Wright. Numerical optimization. Springer, 2006.Google Scholar
  60. [362]
    G. Orr and K.-R. Müller (editors). Neural Networks: Tricks of the Trade, Springer, 1998.Google Scholar
  61. [368]
    R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. ICML Conference, 28, pp. 1310–1318, 2013.Google Scholar
  62. [369]
    R. Pascanu, T. Mikolov, and Y. Bengio. Understanding the exploding gradient problem. CoRR, abs/1211.5063, 2012.Google Scholar
  63. [376]
    E. Polak. Computational methods in optimization: a unified approach. Academic Press, 1971.Google Scholar
  64. [380]
    B. Polyak and A. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4), pp. 838–855, 1992.MathSciNetCrossRefGoogle Scholar
  65. [408]
    D. Rumelhart, G. Hinton, and R. Williams. Learning representations by back-propagating errors. Nature, 323 (6088), pp. 533–536, 1986.CrossRefGoogle Scholar
  66. [409]
    D. Rumelhart, G. Hinton, and R. Williams. Learning internal representations by back-propagating errors. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, pp. 318–362, 1986.Google Scholar
  67. [419]
    T. Salimans and D. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. NIPS Conference, pp. 901–909, 2016.Google Scholar
  68. [426]
    A. Saxe, J. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120, 2013.Google Scholar
  69. [429]
    T. Schaul, S. Zhang, and Y. LeCun. No more pesky learning rates. ICML Confererence, pp. 343–351, 2013.Google Scholar
  70. [443]
    J. Shewchuk. An introduction to the conjugate gradient method without the agonizing pain. Technical Report, CMU-CS-94-125, Carnegie-Mellon University, 1994.Google Scholar
  71. [458]
    J. Snoek, H. Larochelle, and R. Adams. Practical bayesian optimization of machine learning algorithms. NIPS Conference, pp. 2951–2959, 2013.Google Scholar
  72. [478]
    I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. ICML Confererence, pp. 1139–1147, 2013.Google Scholar
  73. [490]
    C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. ACM KDD Conference, pp. 847–855, 2013.Google Scholar
  74. [524]
    P. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, 1974.Google Scholar
  75. [525]
    P. Werbos. The roots of backpropagation: from ordered derivatives to neural networks and political forecasting (Vol. 1). John Wiley and Sons, 1994.Google Scholar
  76. [532]
    S. Wieseler and H. Ney. A convergence analysis of log-linear training. NIPS Conference, pp. 657–665, 2011.Google Scholar
  77. [541]
    O. Yadan, K. Adams, Y. Taigman, and M. Ranzato. Multi-gpu training of convnets. arXiv:1312.5853, 2013.
  78. [545]
    H. Yu and B. Wilamowski. Levenberg–Marquardt training. Industrial Electronics Handbook, 5(12), 1, 2011.Google Scholar
  79. [553]
    M. Zeiler. ADADELTA: an adaptive learning rate method. arXiv:1212.5701, 2012.
  80. [571]
  81. [572]
  82. [573]
  83. [574]
  84. [614]
  85. [615]
  86. [616]
  87. [643]
  88. [644]
  89. [645]

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.IBM T. J. Watson Research CenterInternational Business MachinesYorktown HeightsUSA

Personalised recommendations