Generalization Error in Deep Learning

  • Daniel Jakubovitz
  • Raja Giryes
  • Miguel R. D. RodriguesEmail author
Part of the Applied and Numerical Harmonic Analysis book series (ANHA)


Deep learning models have lately shown great performance in various fields such as computer vision, speech recognition, speech translation, and natural language processing. However, alongside their state-of-the-art performance, it is still generally unclear what is the source of their generalization ability. Thus, an important question is what makes deep neural networks able to generalize well from the training set to new data. In this chapter, we provide an overview of the existing theory and bounds for the characterization of the generalization error of deep neural networks, combining both classical and more recent theoretical and empirical results.


  1. 1.
    I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, 2016)Google Scholar
  2. 2.
    B.D. Haeffele, R. Vidal, Global optimality in neural network training, in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017, pp. 4390–4398Google Scholar
  3. 3.
    R. Vidal, J. Bruna, R. Giryes, S. Soatto, Mathematics of deep learning, in Proceedings of the Conference on Decision and Control (CDC) (2017)Google Scholar
  4. 4.
    V.N. Vapnik, A. Chervonenkis, The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recogn. Image Anal. 1(3), 260–284 (1991)Google Scholar
  5. 5.
    P.L. Bartlett, S. Mendelson, Rademacher and Gaussian complexities: risk bounds and structural results. J. Mach. Learn. Res. 3(3), 463–482 (2002)MathSciNetzbMATHGoogle Scholar
  6. 6.
    C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding deep learning requires rethinking generalization, in ICLR (2017)Google Scholar
  7. 7.
    D.A. McAllester, PAC-Bayesian model averaging, in Proceedings of the twelfth annual Conference on Computational Learning Theory (ACM, 1999), pp. 164–170Google Scholar
  8. 8.
    D.A. McAllester, Some PAC-Bayesian theorems. Mach. Learn. 37(3), 355–363 (1999)CrossRefGoogle Scholar
  9. 9.
    D. McAllester, Simplified PAC-Bayesian margin bounds, in Learning Theory and Kernel Machines, ed. by B. Schlkopf, M.K. Warmuth (Springer, Berlin, 2003), pp. 203–215CrossRefGoogle Scholar
  10. 10.
    O. Bousquet, A. Elisseef, Stability and generalization. J. Mach. Learn. Res. 2, 499–526 (2002)Google Scholar
  11. 11.
    H. Xu, S. Mannor, Robustness and generalization. Mach. Learn. 86(3), 391–423 (2012)MathSciNetCrossRefGoogle Scholar
  12. 12.
    K.P. Murphy, Machine Learning: A Probabilistic Perspective, 1st edn. (MIT Press, 2013)Google Scholar
  13. 13.
    S. Shalev-Shwartz, S. Ben-David, Understanding Machine Learning: From Theory to Algorithms (Cambridge University Press, New York, 2014)CrossRefGoogle Scholar
  14. 14.
    N.S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, P.T.P. Tang, On large-batch training for deep learning: generalization gap and sharp minima, in ICLR (2017)Google Scholar
  15. 15.
    S. Arora, R. Ge, B. Neyshabur, Y. Zhang, Stronger generalization bounds for deep nets via a compression approach, in Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, 10–15 July 2018, pp. 254–263Google Scholar
  16. 16.
    S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift, in Proceedings of the 32nd International Conference on International Conference on Machine Learning, ICML’15, vol. 37 (2015)., pp. 448–456
  17. 17.
    N. Golowich, A. Rakhlin, O. Shamir, Size-independent sample complexity of neural networks, in Bubeck, S., Perchet, V., Rigollet, P., (eds.) Proceedings of the 31st Conference on Learning Theory of Proceedings of Machine Learning Research, vol. 75, PMLR, 06–09 July 2018, pp. 297–299Google Scholar
  18. 18.
    B. Neyshabur, S. Bhojanapalli, D. McAllester, N. Srebro, Exploring generalization in deep learning, in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4–9 December 2017, Long Beach, CA, USA (2017), pp. 5949–5958Google Scholar
  19. 19.
    N. Harvey, C. Liaw, A. Mehrabian, Nearly-tight VC-dimension bounds for piecewise linear neural networks, in Proceedings of the 2017 Conference on Learning Theory, vol. 65 of Proceedings of Machine Learning Research, ed. by S. Kale, O. Shamir, Amsterdam, Netherlands, PMLR, 07–10 July 2017, pp. 1064–1068Google Scholar
  20. 20.
    P.L. Bartlett, V. Maiorov, R. Meir, Almost linear VC-dimension bounds for piecewise polynomial networks. Neural Comput. 10(8), 2159–2173 (1998)CrossRefGoogle Scholar
  21. 21.
    M. Anthony, P.L. Bartlett, Neural Network Learning: Theoretical Foundations (Cambridge University Press, New York, 2009)zbMATHGoogle Scholar
  22. 22.
    B. Neyshabur, R. Tomioka, N. Srebro, Norm-based capacity control in neural networks, in Proceedings of The 28th Conference on Learning Theory of Proceedings of Machine Learning Research, vol. 40, ed. by P. Grnwald, E. Hazan, S. Kale, Paris, France, PMLR, 03–06 July 2015, pp. 1376–1401Google Scholar
  23. 23.
    B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun, N. Srebro, Towards understanding the role of over-parametrization in generalization of neural networks (2018). arXiv:1805.12076
  24. 24.
    A.R. Barron, J.M. Klusowski, Approximation and estimation for high-dimensional deep learning networks (2018). arXiv:1809.03090
  25. 25.
    B. Neyshabur, S. Bhojanapalli, N. Srebro, A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks, in ICLR (2018)Google Scholar
  26. 26.
    P.L. Bartlett, D.J. Foster, M.J. Telgarsky, Spectrally-normalized margin bounds for neural networks, in Advances in Neural Information Processing Systems 30, ed. by I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Curran Associates, Inc., 2017), pp. 6240–6249Google Scholar
  27. 27.
    Dziugaite, G.K., Roy, D.M.: Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data, in Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI 2016, pp. 11–15 (NSW, Sydney, 2017)Google Scholar
  28. 28.
    L. Dinh, R. Pascanu, S. Bengio, Y. Bengio, Sharp minima can generalize for deep nets, in Proceedings of the 34th International Conference on Machine Learning of Proceedings of Machine Learning Research, vol. 70, International Convention Centre, Sydney, Australia, PMLR, 06–11 August 2017, pp. 1019–1028Google Scholar
  29. 29.
    M. Hardt, B. Recht, Y. Singer, Train faster, generalize better: stability of stochastic gradient descent, in ICML (2016)Google Scholar
  30. 30.
    A.W. van der Vaart, J.A. Wellner, Weak convergence and empirical processes: with applications to statistics. Springer Series in Statistics. (Springer, 1996)Google Scholar
  31. 31.
    J. Sokolic, R. Giryes, G. Sapiro, M.R.D. Rodrigues, Robust large margin deep neural networks. IEEE Trans. Signal Process. 65(16), 4265–4280 (2017)MathSciNetCrossRefGoogle Scholar
  32. 32.
    W. Zhou, V. Veitch, M. Austern, R.P. Adams, P. Orbanz, Compressibility and generalization in large-scale deep learning (2018). arXiv:1804.05862
  33. 33.
    D. Soudry, E. Hoffer, M.S. Nacson, N. Srebro, The implicit bias of gradient descent on separable data, in ICLR (2018)Google Scholar
  34. 34.
    A. Brutzkus, A. Globerson, E. Malach, S. Shalev-Shwartz, Sgd learns over-parameterized networks that provably generalize on linearly separable data, in ICLR (2018)Google Scholar
  35. 35.
    H. Zhang, J. Shao, R. Salakhutdinov, Deep neural networks with multi-branch architectures are less non-convex (2018). arXiv:1806.01845
  36. 36.
    T.A. Poggio, K. Kawaguchi, Q. Liao, B. Miranda, L. Rosasco, X. Boix, J. Hidary, H. Mhaskar, Theory of deep learning iii: explaining the non-overfitting puzzle (2017). CoRR arXiv:1801.00173
  37. 37.
    E. Hoffer, I. Hubara, D. Soudry, Train longer, generalize better: closing the generalization gap in large batch training of neural networks, in Advances in Neural Information Processing Systems 30, ed. by I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Curran Associates, Inc., 2017), pp. 1731–1741Google Scholar
  38. 38.
    J. Sokolic, R. Giryes, G. Sapiro, M.R.D. Rodrigues, Generalization error of invariant classifiers, in Artificial Intelligence and Statistics (2017), pp. 1094–1103Google Scholar
  39. 39.
    J. Bruna, S. Mallat, Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1872–1886 (2013)CrossRefGoogle Scholar
  40. 40.
    C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, R. Fergus, Intriguing properties of neural networks, in International Conference on Learning Representations (2014)Google Scholar
  41. 41.
    I.J. Goodfellow, J. Shlens, C. Szegedy, Explaining and harnessing adversarial examples, in ICLR (2015)Google Scholar
  42. 42.
    R. Novak, Y. Bahri, D.A. Abolafia, J. Pennington, J. Sohl-Dickstein, Sensitivity and generalization in neural networks: an empirical study, in ICLR (2018)Google Scholar
  43. 43.
    D. Jakubovitz, R. Giryes, Improving DNN robustness to adversarial attacks using Jacobian regularization, in The European Conference on Computer Vision (ECCV) (2018)CrossRefGoogle Scholar
  44. 44.
    L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, A. Madry, Adversarially robust generalization requires more data, in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3–8 December 2018, Montréal, Canada. (2018) 5019–5031Google Scholar
  45. 45.
    N. Akhtar, A.S. Mian, Threat of adversarial attacks on deep learning in computer vision: a survey. IEEE Access 6, 14410–14430 (2018)CrossRefGoogle Scholar
  46. 46.
    T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, Q. Liao, Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. Int. J. Autom. Comput. 1–17 (2017)Google Scholar
  47. 47.
    G. Cohen, G. Sapiro, R. Giryes, Dnn or k-nn: That is the generalize vs. memorize question (2018). arXiv:1805.06822
  48. 48.
    D. Vainsencher, S. Mannor, A.M. Bruckstein, The sample complexity of dictionary learning. J. Mach. Learn. Res. 12, 3259–3281 (2011)MathSciNetzbMATHGoogle Scholar
  49. 49.
    A. Jung, Y.C. Eldar, N. Grtz, Performance limits of dictionary learning for sparse coding. In: European Signal Processing Conference (EUSIPCO). (Sept 2014) 765–769Google Scholar
  50. 50.
    R. Gribonval, R. Jenatton, F. Bach, Sparse and spurious: dictionary learning with noise and outliers. IEEE Trans. Inf. Theory 61(11), 6298–6319 (2015)MathSciNetCrossRefGoogle Scholar
  51. 51.
    R. Gribonval, R. Jenatton, F. Bach, M. Kleinsteuber, M. Seibert, Sample complexity of dictionary learning and other matrix factorizations. IEEE Trans. Inf. Theory 61(6), 3469–3486 (2015)MathSciNetCrossRefGoogle Scholar
  52. 52.
    K. Schnass, Convergence radius and sample complexity of itkm algorithms for dictionary learning. Appl. Comput. Harmon. Anal. 45(1), 22–58 (2018)MathSciNetCrossRefGoogle Scholar
  53. 53.
    S. Singh, B. Pczos, J. Ma, On the reconstruction risk of convolutional sparse dictionary learning. In: AISTATS (2018)Google Scholar
  54. 54.
    V. Papyan, Y. Romano, M. Elad, Convolutional neural networks analyzed via convolutional sparse coding. J. Mach. Learn. Res. (JMLR) 18(83), 1–52 (2017)MathSciNetzbMATHGoogle Scholar
  55. 55.
    A. Gepperth, B. Hammer, Incremental learning algorithms and applications. In: European Symposium on Artificial Neural Networks (ESANN) (2016)Google Scholar
  56. 56.
    L. Torrey, J. Shavlik, Transfer Learning, in Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, ed. by E. Soria, J. Martin, R. Magdalena, M. Martinez, A. Serrano. IGI Global (2009)Google Scholar
  57. 57.
    S.J. Pan, Q. Yang, A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)CrossRefGoogle Scholar
  58. 58.
    F. Tramer, N. Papernot, I. Goodfellow, D. Boneh, P. McDaniel, The space of transferable adversarial examples (2017). arXiv:1704.03453
  59. 59.
    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in Advances in Neural Information Processing Systems 27, ed. by Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, K.Q. Weinberger (Curran Associates, Inc., 2014), pp. 2672–2680Google Scholar
  60. 60.
    S. Arora, R. Ge, Y. Liang, T. Ma, Y. Zhang, Generalization and equilibrium in generative adversarial nets (gans). In: ICML (2017)Google Scholar
  61. 61.
    D.P. Kingma, M. Welling, Auto-encoding variational bayes. In: ICLR (2014)Google Scholar
  62. 62.
    N. Tishby, N. Zaslavsky, Deep learning and the information bottleneck principle. In: 2015 IEEE Information Theory Workshop (ITW) (April 2015), pp. 1–5Google Scholar
  63. 63.
    R. Schwartz-Ziv, N. Tishby, Opening the black box of deep neural networks via information (2017). arXiv:1703.00810
  64. 64.
    T.M. Cover, J.A. Thomas, Elements of Information Theory (Wiley, New York, NY, USA, 2006)zbMATHGoogle Scholar
  65. 65.
    M. Vera, L.R. Vega, P. Piantanida, Compression-based regularization with an application to multi-task learning. IEEE J. Sel. Top. Signal Process. 1–1 (2018)Google Scholar
  66. 66.
    P. Piantanida, L. Rey Vega, Information bottleneck and representation learning. In: Information-Theoretic Methods in Data Science (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Daniel Jakubovitz
    • 1
  • Raja Giryes
    • 1
  • Miguel R. D. Rodrigues
    • 2
    Email author
  1. 1.School of Electrical EngineeringTel Aviv UniversityTel Aviv-YafoIsrael
  2. 2.Department of Electronics and Electrical EngineeringUniversity College LondonLondonUK

Personalised recommendations