Teaching Deep Learners to Generalize



Neural networks are powerful learners that have repeatedly proven to be capable of learning complex functions in many domains. However, the great power of neural networks is also their greatest weakness; neural networks often simply overfit the training data if care is not taken to design the learning process carefully.


Variational Autoencoder Generative Adversarial Networks Unseen Test Instances True Loss Function Contractive Autoencoder 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. [5]
    C. Aggarwal. Outlier analysis. Springer, 2017.Google Scholar
  2. [31]
    Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. NIPS Conference, 19, 153, 2007.Google Scholar
  3. [32]
    Y. Bengio, N. Le Roux, P. Vincent, O. Delalleau, and P. Marcotte. Convex neural networks. NIPS Conference, pp. 123–130, 2005.Google Scholar
  4. [33]
    Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. ICML Conference, 2009.Google Scholar
  5. [34]
    Y. Bengio, L. Yao, G. Alain, and P. Vincent. Generalized denoising auto-encoders as generative models. NIPS Conference, pp. 899–907, 2013.Google Scholar
  6. [44]
    C. M. Bishop. Training with noise is equivalent to Tikhonov regularization. Neural computation, 7(1),pp. 108–116, 1995.CrossRefGoogle Scholar
  7. [50]
    L. Breiman. Bagging predictors. Machine Learning, 24(2), pp. 123–140, 1996.MathSciNetzbMATHGoogle Scholar
  8. [56]
    P. Bühlmann and B. Yu. Analyzing bagging. Annals of Statistics, pp. 927–961, 2002.Google Scholar
  9. [58]
    Y. Burda, R. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. arXiv:1509.00519, 2015.
  10. [63]
    N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, pp. 321–357, 2002.CrossRefGoogle Scholar
  11. [64]
    J. Chen, S. Sathe, C. Aggarwal, and D. Turaga. Outlier detection with autoencoder ensembles. SIAM Conference on Data Mining, 2017.Google Scholar
  12. [67]
    Y. Chen and M. Zaki. KATE: K-Competitive Autoencoder for Text. ACM KDD Conference, 2017.Google Scholar
  13. [106]
    C. Doersch. Tutorial on variational autoencoders. arXiv:1606.05908, 2016.
  14. [107]
    H. Drucker and Y. LeCun. Improving generalization performance using double backpropagation. IEEE Transactions on Neural Networks, 3(6), pp. 991–997, 1992.CrossRefGoogle Scholar
  15. [112]
    J. Elman. Learning and development in neural networks: The importance of starting small. Cognition, 48, pp. 781–799, 1993.CrossRefGoogle Scholar
  16. [113]
    D. Erhan, Y. Bengio, A. Courville, P. Manzagol, P. Vincent, and S. Bengio. Why does unsupervised pre-training help deep learning?. Journal of Machine Learning Research, 11, pp. 625–660, 2010.MathSciNetzbMATHGoogle Scholar
  17. [122]
    Y. Freund and R. Schapire. A decision-theoretic generalization of online learning and application to boosting. Computational Learning Theory, pp. 23–37, 1995.Google Scholar
  18. [161]
    K. Greff, R. K. Srivastava, and J. Schmidhuber. Highway and residual networks learn unrolled iterative estimation. arXiv:1612.07771, 2016.
  19. [170]
    L. K. Hansen and P. Salamon. Neural network ensembles. IEEE TPAMI, 12(10), pp. 993–1001, 1990.CrossRefGoogle Scholar
  20. [175]
    B. Hassibi and D. Stork. Second order derivatives for network pruning: Optimal brain surgeon. NIPS Conference, 1993.Google Scholar
  21. [177]
    T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning. Springer, 2009.Google Scholar
  22. [179]
    T. Hastie, R. Tibshirani, and M. Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC Press, 2015.Google Scholar
  23. [184]
    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.Google Scholar
  24. [192]
    G. Hinton. To recognize shapes, first learn to generate images. Progress in Brain Research, 165, pp. 535–547, 2007.CrossRefGoogle Scholar
  25. [196]
    G. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7), pp. 1527–1554, 2006.MathSciNetCrossRefGoogle Scholar
  26. [201]
    G. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012.
  27. [204]
    S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8), pp. 1735–1785, 1997.CrossRefGoogle Scholar
  28. [238]
    F. Khan, B. Mutlu, and X. Zhu. How do humans teach: On curriculum learning and teaching dimension. NIPS Conference, pp. 1449–1457, 2011.Google Scholar
  29. [242]
    D. Kingma and M. Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2013.
  30. [244]
    S. Kirkpatrick, C. Gelatt, and M. Vecchi. Optimization by simulated annealing. Science, 220, pp. 671–680, 1983.MathSciNetCrossRefGoogle Scholar
  31. [247]
    R. Kohavi and D. Wolpert. Bias plus variance decomposition for zero-one loss functions. ICML Conference, 1996.Google Scholar
  32. [252]
    E. Kong and T. Dietterich. Error-correcting output coding corrects bias and variance. ICML Conference, pp. 313–321, 1995.Google Scholar
  33. [255]
    A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. NIPS Conference, pp. 1097–1105. 2012.Google Scholar
  34. [273]
    Q. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Ng, On optimization methods for deep learning. ICML Conference, pp. 265–272, 2011.Google Scholar
  35. [274]
    Q. Le, W. Zou, S. Yeung, and A. Ng. Learning hierarchical spatio-temporal features for action recognition with independent subspace analysis. CVPR Conference, 2011.Google Scholar
  36. [282]
    Y. LeCun, J. Denker, and S. Solla. Optimal brain damage. NIPS Conference, pp. 598–605, 1990.Google Scholar
  37. [284]
    H. Lee, C. Ekanadham, and A. Ng. Sparse deep belief net model for visual area V2. NIPS Conference, 2008.Google Scholar
  38. [303]
    J. Ma, R. P. Sheridan, A. Liaw, G. E. Dahl, and V. Svetnik. Deep neural nets as a method for quantitative structure-activity relationships. Journal of Chemical Information and Modeling, 55(2), pp. 263–274, 2015.CrossRefGoogle Scholar
  39. [311]
    A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv:1511.05644, 2015.
  40. [339]
    H. Mobahi and J. Fisher. A theoretical analysis of optimization by Gaussian continuation. AAAI Conference, 2015.Google Scholar
  41. [354]
  42. [360]
    S. Nowlan and G. Hinton. Simplifying neural networks by soft weight-sharing. Neural Computation, 4(4), pp. 473–493, 1992.CrossRefGoogle Scholar
  43. [382]
    B. Poole, J. Sohl-Dickstein, and S. Ganguli. Analyzing noise in autoencoders and deep networks. arXiv:1406.1831, 2014.
  44. [386]
    M.’ A. Ranzato, Y-L. Boureau, and Y. LeCun. Sparse feature learning for deep belief networks. NIPS Conference, pp. 1185–1192, 2008.Google Scholar
  45. [388]
    A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko. Semi-supervised learning with ladder networks. NIPS Conference, pp. 3546–3554, 2015.Google Scholar
  46. [392]
    S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. ICML Conference, pp. 1060–1069, 2016.Google Scholar
  47. [397]
    S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. ICML Conference, pp. 833–840, 2011.Google Scholar
  48. [398]
    S. Rifai, Y. Dauphin, P. Vincent, Y. Bengio, and X. Muller. The manifold tangent classifier. NIPS Conference, pp. 2294–2302, 2011.Google Scholar
  49. [399]
    D. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv:1401.4082, 2014.
  50. [422]
    T Sanger. Neural network learning control of robot manipulators using gradually increasing task difficulty. IEEE Transactions on Robotics and Automation, 10(3), 1994.Google Scholar
  51. [435]
    H. Schwenk and Y. Bengio. Boosting neural networks. Neural Computation, 12(8), pp. 1869–1887, 2000.CrossRefGoogle Scholar
  52. [438]
    G. Seni and J. Elder. Ensemble methods in data mining: Improving accuracy through combining predictions. Morgan and Claypool, 2010.Google Scholar
  53. [450]
    J. Sietsma and R. Dow. Creating artificial neural networks that generalize. Neural Networks, 4(1), pp. 67–79, 1991.CrossRefGoogle Scholar
  54. [463]
    K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. NIPS Conference, 2015.Google Scholar
  55. [464]
    R. Solomonoff. A system for incremental learning based on algorithmic probability. Sixth Israeli Conference on Artificial Intelligence, Computer Vision and Pattern Recognition, pp. 515–527, 1994.Google Scholar
  56. [467]
    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), pp. 1929–1958, 2014.MathSciNetzbMATHGoogle Scholar
  57. [470]
    R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv:1505.00387, 2015.
  58. [472]
    F. Strub and J. Mary. Collaborative filtering with stacked denoising autoencoders and sparse inputs. NIPS Workshop on Machine Learning for eCommerce, 2015.Google Scholar
  59. [499]
    A. Tikhonov and V. Arsenin. Solution of ill-posed problems. Winston and Sons, 1977.Google Scholar
  60. [502]
    H. Valpola. From neural PCA to deep unsupervised learning. Advances in Independent Component Analysis and Learning Machines, pp. 143–171, Elsevier, 2015.Google Scholar
  61. [506]
    P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol. Extracting and composing robust features with denoising autoencoders. ICML Confererence, pp. 1096–1103, 2008.Google Scholar
  62. [510]
    J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncertain future: Forecasting from static images using variational autoencoders. European Conference on Computer Vision, pp. 835–851, 2016.Google Scholar
  63. [511]
    L. Wan, M. Zeiler, S. Zhang, Y. LeCun, and R. Fergus. Regularization of neural networks using dropconnect. ICML Conference, pp. 1058–1066, 2013.Google Scholar
  64. [515]
    S. Wang, C. Aggarwal, and H. Liu. Using a random forest to inspire a neural network and improving on it. SIAM Conference on Data Mining, 2017.Google Scholar
  65. [535]
    Y. Wu, C. DuBois, A. Zheng, and M. Ester. Collaborative denoising auto-encoders for top-n recommender systems. Web Search and Data Mining, pp. 153–162, 2016.Google Scholar
  66. [536]
    Z. Wu. Global continuation for distance geometry problems. SIAM Journal of Optimization, 7, pp. 814–836, 1997.MathSciNetCrossRefGoogle Scholar
  67. [557]
    C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. arXiv:1611.03530.
  68. [566]
    Z.-H. Zhou. Ensemble methods: Foundations and algorithms. CRC Press, 2012.Google Scholar
  69. [567]
    Z.-H. Zhou, J. Wu, and W. Tang. Ensembling neural networks: many could be better than all. Artificial Intelligence, 137(1–2), pp. 239–263, 2002.MathSciNetCrossRefGoogle Scholar
  70. [587]
  71. [595]
  72. [596]
  73. [597]
  74. [601]
  75. [640]

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.IBM T. J. Watson Research CenterInternational Business MachinesYorktown HeightsUSA

Personalised recommendations