Skip to main content

Generalization Error in Deep Learning

  • Chapter
  • First Online:
Compressed Sensing and Its Applications

Abstract

Deep learning models have lately shown great performance in various fields such as computer vision, speech recognition, speech translation, and natural language processing. However, alongside their state-of-the-art performance, it is still generally unclear what is the source of their generalization ability. Thus, an important question is what makes deep neural networks able to generalize well from the training set to new data. In this chapter, we provide an overview of the existing theory and bounds for the characterization of the generalization error of deep neural networks, combining both classical and more recent theoretical and empirical results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    \(\mathcal {\tilde{O}}\) is the upper bound to the complexity up to a logarithmic factor of the same term.

References

  1. I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, 2016)

    Google Scholar 

  2. B.D. Haeffele, R. Vidal, Global optimality in neural network training, in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017, pp. 4390–4398

    Google Scholar 

  3. R. Vidal, J. Bruna, R. Giryes, S. Soatto, Mathematics of deep learning, in Proceedings of the Conference on Decision and Control (CDC) (2017)

    Google Scholar 

  4. V.N. Vapnik, A. Chervonenkis, The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recogn. Image Anal. 1(3), 260–284 (1991)

    Google Scholar 

  5. P.L. Bartlett, S. Mendelson, Rademacher and Gaussian complexities: risk bounds and structural results. J. Mach. Learn. Res. 3(3), 463–482 (2002)

    MathSciNet  MATH  Google Scholar 

  6. C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding deep learning requires rethinking generalization, in ICLR (2017)

    Google Scholar 

  7. D.A. McAllester, PAC-Bayesian model averaging, in Proceedings of the twelfth annual Conference on Computational Learning Theory (ACM, 1999), pp. 164–170

    Google Scholar 

  8. D.A. McAllester, Some PAC-Bayesian theorems. Mach. Learn. 37(3), 355–363 (1999)

    Article  Google Scholar 

  9. D. McAllester, Simplified PAC-Bayesian margin bounds, in Learning Theory and Kernel Machines, ed. by B. Schlkopf, M.K. Warmuth (Springer, Berlin, 2003), pp. 203–215

    Chapter  Google Scholar 

  10. O. Bousquet, A. Elisseef, Stability and generalization. J. Mach. Learn. Res. 2, 499–526 (2002)

    Google Scholar 

  11. H. Xu, S. Mannor, Robustness and generalization. Mach. Learn. 86(3), 391–423 (2012)

    Google Scholar 

  12. K.P. Murphy, Machine Learning: A Probabilistic Perspective, 1st edn. (MIT Press, 2013)

    Google Scholar 

  13. S. Shalev-Shwartz, S. Ben-David, Understanding Machine Learning: From Theory to Algorithms (Cambridge University Press, New York, 2014)

    Book  Google Scholar 

  14. N.S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, P.T.P. Tang, On large-batch training for deep learning: generalization gap and sharp minima, in ICLR (2017)

    Google Scholar 

  15. S. Arora, R. Ge, B. Neyshabur, Y. Zhang, Stronger generalization bounds for deep nets via a compression approach, in Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, 10–15 July 2018, pp. 254–263

    Google Scholar 

  16. S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift, in Proceedings of the 32nd International Conference on International Conference on Machine Learning, ICML’15, vol. 37 (2015). www.jmlr.org, pp. 448–456

  17. N. Golowich, A. Rakhlin, O. Shamir, Size-independent sample complexity of neural networks, in Bubeck, S., Perchet, V., Rigollet, P., (eds.) Proceedings of the 31st Conference on Learning Theory of Proceedings of Machine Learning Research, vol. 75, PMLR, 06–09 July 2018, pp. 297–299

    Google Scholar 

  18. B. Neyshabur, S. Bhojanapalli, D. McAllester, N. Srebro, Exploring generalization in deep learning, in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4–9 December 2017, Long Beach, CA, USA (2017), pp. 5949–5958

    Google Scholar 

  19. N. Harvey, C. Liaw, A. Mehrabian, Nearly-tight VC-dimension bounds for piecewise linear neural networks, in Proceedings of the 2017 Conference on Learning Theory, vol. 65 of Proceedings of Machine Learning Research, ed. by S. Kale, O. Shamir, Amsterdam, Netherlands, PMLR, 07–10 July 2017, pp. 1064–1068

    Google Scholar 

  20. P.L. Bartlett, V. Maiorov, R. Meir, Almost linear VC-dimension bounds for piecewise polynomial networks. Neural Comput. 10(8), 2159–2173 (1998)

    Article  Google Scholar 

  21. M. Anthony, P.L. Bartlett, Neural Network Learning: Theoretical Foundations (Cambridge University Press, New York, 2009)

    MATH  Google Scholar 

  22. B. Neyshabur, R. Tomioka, N. Srebro, Norm-based capacity control in neural networks, in Proceedings of The 28th Conference on Learning Theory of Proceedings of Machine Learning Research, vol. 40, ed. by P. Grnwald, E. Hazan, S. Kale, Paris, France, PMLR, 03–06 July 2015, pp. 1376–1401

    Google Scholar 

  23. B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun, N. Srebro, Towards understanding the role of over-parametrization in generalization of neural networks (2018). arXiv:1805.12076

  24. A.R. Barron, J.M. Klusowski, Approximation and estimation for high-dimensional deep learning networks (2018). arXiv:1809.03090

  25. B. Neyshabur, S. Bhojanapalli, N. Srebro, A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks, in ICLR (2018)

    Google Scholar 

  26. P.L. Bartlett, D.J. Foster, M.J. Telgarsky, Spectrally-normalized margin bounds for neural networks, in Advances in Neural Information Processing Systems 30, ed. by I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Curran Associates, Inc., 2017), pp. 6240–6249

    Google Scholar 

  27. Dziugaite, G.K., Roy, D.M.: Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data, in Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI 2016, pp. 11–15 (NSW, Sydney, 2017)

    Google Scholar 

  28. L. Dinh, R. Pascanu, S. Bengio, Y. Bengio, Sharp minima can generalize for deep nets, in Proceedings of the 34th International Conference on Machine Learning of Proceedings of Machine Learning Research, vol. 70, International Convention Centre, Sydney, Australia, PMLR, 06–11 August 2017, pp. 1019–1028

    Google Scholar 

  29. M. Hardt, B. Recht, Y. Singer, Train faster, generalize better: stability of stochastic gradient descent, in ICML (2016)

    Google Scholar 

  30. A.W. van der Vaart, J.A. Wellner, Weak convergence and empirical processes: with applications to statistics. Springer Series in Statistics. (Springer, 1996)

    Google Scholar 

  31. J. Sokolic, R. Giryes, G. Sapiro, M.R.D. Rodrigues, Robust large margin deep neural networks. IEEE Trans. Signal Process. 65(16), 4265–4280 (2017)

    Article  MathSciNet  Google Scholar 

  32. W. Zhou, V. Veitch, M. Austern, R.P. Adams, P. Orbanz, Compressibility and generalization in large-scale deep learning (2018). arXiv:1804.05862

  33. D. Soudry, E. Hoffer, M.S. Nacson, N. Srebro, The implicit bias of gradient descent on separable data, in ICLR (2018)

    Google Scholar 

  34. A. Brutzkus, A. Globerson, E. Malach, S. Shalev-Shwartz, Sgd learns over-parameterized networks that provably generalize on linearly separable data, in ICLR (2018)

    Google Scholar 

  35. H. Zhang, J. Shao, R. Salakhutdinov, Deep neural networks with multi-branch architectures are less non-convex (2018). arXiv:1806.01845

  36. T.A. Poggio, K. Kawaguchi, Q. Liao, B. Miranda, L. Rosasco, X. Boix, J. Hidary, H. Mhaskar, Theory of deep learning iii: explaining the non-overfitting puzzle (2017). CoRR arXiv:1801.00173

  37. E. Hoffer, I. Hubara, D. Soudry, Train longer, generalize better: closing the generalization gap in large batch training of neural networks, in Advances in Neural Information Processing Systems 30, ed. by I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Curran Associates, Inc., 2017), pp. 1731–1741

    Google Scholar 

  38. J. Sokolic, R. Giryes, G. Sapiro, M.R.D. Rodrigues, Generalization error of invariant classifiers, in Artificial Intelligence and Statistics (2017), pp. 1094–1103

    Google Scholar 

  39. J. Bruna, S. Mallat, Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1872–1886 (2013)

    Article  Google Scholar 

  40. C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, R. Fergus, Intriguing properties of neural networks, in International Conference on Learning Representations (2014)

    Google Scholar 

  41. I.J. Goodfellow, J. Shlens, C. Szegedy, Explaining and harnessing adversarial examples, in ICLR (2015)

    Google Scholar 

  42. R. Novak, Y. Bahri, D.A. Abolafia, J. Pennington, J. Sohl-Dickstein, Sensitivity and generalization in neural networks: an empirical study, in ICLR (2018)

    Google Scholar 

  43. D. Jakubovitz, R. Giryes, Improving DNN robustness to adversarial attacks using Jacobian regularization, in The European Conference on Computer Vision (ECCV) (2018)

    Google Scholar 

  44. L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, A. Madry, Adversarially robust generalization requires more data, in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3–8 December 2018, Montréal, Canada. (2018) 5019–5031

    Google Scholar 

  45. N. Akhtar, A.S. Mian, Threat of adversarial attacks on deep learning in computer vision: a survey. IEEE Access 6, 14410–14430 (2018)

    Article  Google Scholar 

  46. T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, Q. Liao, Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. Int. J. Autom. Comput. 1–17 (2017)

    Google Scholar 

  47. G. Cohen, G. Sapiro, R. Giryes, Dnn or k-nn: That is the generalize vs. memorize question (2018). arXiv:1805.06822

  48. D. Vainsencher, S. Mannor, A.M. Bruckstein, The sample complexity of dictionary learning. J. Mach. Learn. Res. 12, 3259–3281 (2011)

    MathSciNet  MATH  Google Scholar 

  49. A. Jung, Y.C. Eldar, N. Grtz, Performance limits of dictionary learning for sparse coding. In: European Signal Processing Conference (EUSIPCO). (Sept 2014) 765–769

    Google Scholar 

  50. R. Gribonval, R. Jenatton, F. Bach, Sparse and spurious: dictionary learning with noise and outliers. IEEE Trans. Inf. Theory 61(11), 6298–6319 (2015)

    Article  MathSciNet  Google Scholar 

  51. R. Gribonval, R. Jenatton, F. Bach, M. Kleinsteuber, M. Seibert, Sample complexity of dictionary learning and other matrix factorizations. IEEE Trans. Inf. Theory 61(6), 3469–3486 (2015)

    Article  MathSciNet  Google Scholar 

  52. K. Schnass, Convergence radius and sample complexity of itkm algorithms for dictionary learning. Appl. Comput. Harmon. Anal. 45(1), 22–58 (2018)

    Article  MathSciNet  Google Scholar 

  53. S. Singh, B. Pczos, J. Ma, On the reconstruction risk of convolutional sparse dictionary learning. In: AISTATS (2018)

    Google Scholar 

  54. V. Papyan, Y. Romano, M. Elad, Convolutional neural networks analyzed via convolutional sparse coding. J. Mach. Learn. Res. (JMLR) 18(83), 1–52 (2017)

    MathSciNet  MATH  Google Scholar 

  55. A. Gepperth, B. Hammer, Incremental learning algorithms and applications. In: European Symposium on Artificial Neural Networks (ESANN) (2016)

    Google Scholar 

  56. L. Torrey, J. Shavlik, Transfer Learning, in Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, ed. by E. Soria, J. Martin, R. Magdalena, M. Martinez, A. Serrano. IGI Global (2009)

    Google Scholar 

  57. S.J. Pan, Q. Yang, A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)

    Article  Google Scholar 

  58. F. Tramer, N. Papernot, I. Goodfellow, D. Boneh, P. McDaniel, The space of transferable adversarial examples (2017). arXiv:1704.03453

  59. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in Advances in Neural Information Processing Systems 27, ed. by Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, K.Q. Weinberger (Curran Associates, Inc., 2014), pp. 2672–2680

    Google Scholar 

  60. S. Arora, R. Ge, Y. Liang, T. Ma, Y. Zhang, Generalization and equilibrium in generative adversarial nets (gans). In: ICML (2017)

    Google Scholar 

  61. D.P. Kingma, M. Welling, Auto-encoding variational bayes. In: ICLR (2014)

    Google Scholar 

  62. N. Tishby, N. Zaslavsky, Deep learning and the information bottleneck principle. In: 2015 IEEE Information Theory Workshop (ITW) (April 2015), pp. 1–5

    Google Scholar 

  63. R. Schwartz-Ziv, N. Tishby, Opening the black box of deep neural networks via information (2017). arXiv:1703.00810

  64. T.M. Cover, J.A. Thomas, Elements of Information Theory (Wiley, New York, NY, USA, 2006)

    MATH  Google Scholar 

  65. M. Vera, L.R. Vega, P. Piantanida, Compression-based regularization with an application to multi-task learning. IEEE J. Sel. Top. Signal Process. 1–1 (2018)

    Google Scholar 

  66. P. Piantanida, L. Rey Vega, Information bottleneck and representation learning. In: Information-Theoretic Methods in Data Science (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Miguel R. D. Rodrigues .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Jakubovitz, D., Giryes, R., Rodrigues, M.R.D. (2019). Generalization Error in Deep Learning. In: Boche, H., Caire, G., Calderbank, R., Kutyniok, G., Mathar, R., Petersen, P. (eds) Compressed Sensing and Its Applications. Applied and Numerical Harmonic Analysis. Birkhäuser, Cham. https://doi.org/10.1007/978-3-319-73074-5_5

Download citation

Publish with us

Policies and ethics