Neural Computing and Applications

, Volume 29, Issue 9, pp 401–412 | Cite as

Training neural networks by marginalizing out hidden layer noise

  • Yanjun Li
  • Ping Guo


The generalization ability of neural networks is influenced by the size of the training set. The training process for single-hidden-layer feedforward neural networks (SLFNs) consists of two stages: nonlinear feature mapping and predictor optimization in the hidden layer space. In this paper, we propose a new approach, called marginalizing out hidden layer noise (MHLN), in which the predictor of SLFNs is trained with infinite samples. First, MHLN augments the training set in the hidden layer space with constrained samples, which are generated by corrupting the hidden layer outputs of the training set with given noise. For any given training sample, when the number of corruptions is close to infinity, according to the weak law of large numbers, the explicitly generated constrained samples can be replaced with their expectations. In this way, the training set is implicitly extended in the hidden layer space by an infinite number of constrained samples. Then, MHLN constructs the predictor of SLFNs by optimizing the expected value of a quadratic loss function under the given noise distribution. The results of experiments on twenty benchmark datasets show that MHLN achieves better generalization ability.


Single-hidden-layer feedforward neural networks Generalization Constrained samples Expectation Quadratic loss function 



Our work is mainly supported by National Natural Science Foundation of China (No. 61375045), Beijing Natural Science Foundation (4142030).

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.


  1. 1.
    Li YJ, Xin X, Guo P (2015) Neural networks with marginalized corrupted hidden layer. In: Proceedings of international conference on neural information processing, pp 506–514Google Scholar
  2. 2.
    Burges CJC, Schölkopf B (1997) Improving the accuracy and speed of support vector machines. In: Advances in neural information processing systems, pp 375–381Google Scholar
  3. 3.
    Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536CrossRefzbMATHGoogle Scholar
  4. 4.
    Wilamowski BM, Yu H (2010) Neural network learning without backpropagation. IEEE Trans Neural Netw 21(11):1793–1803CrossRefGoogle Scholar
  5. 5.
    Hagan MT, Menhaj MB (1994) Training feedforward networks with the marquardt algorithm. IEEE Trans Neural Netw 5(6):989–993CrossRefGoogle Scholar
  6. 6.
    Branke J (1995) Evolutionary algorithms for neural network design and training. In: Proceedings of the first nordic workshop on genetic algorithms and its applicationsGoogle Scholar
  7. 7.
    Rosenblatt F (1962) Principles of neurodynamics: perceptrons and the theory of brain mechanisms. Spartan Books, New YorkzbMATHGoogle Scholar
  8. 8.
    Ding S, Xu X, Nie R (2014) Extreme learning machine and its applications. Neural Comput Appl 25(3–4):549–556CrossRefGoogle Scholar
  9. 9.
    Guo P, Lyu MR (2004) A pseudoinverse learning algorithm for feedforward neural networks with stacked generalization applications to software reliability growth data. Neurocomputing 56:101–121CrossRefGoogle Scholar
  10. 10.
    Vincent P, Larochelle H, Bengio Y, Manzagol P-A (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on machine learning, pp 1096–1103Google Scholar
  11. 11.
    Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408MathSciNetzbMATHGoogle Scholar
  12. 12.
    Glorot X, Bordes A, Bengio Y (2011) Domain adaptation for large-scale sentiment classification: a deep learning approach. In: Proceedings of the 28th international conference on machine learning, pp 513–520Google Scholar
  13. 13.
    Maillet F, Eck D, Desjardins G, Lamere P (2009) Steerable playlist generation by learning song similarity from radio station playlists. In: International society for music information retrieval conference, pp 345–350Google Scholar
  14. 14.
    Xia B, Bao C (2014) Wiener filtering based speech enhancement with weighted denoising auto-encoder and noise classification. Speech Commun 60:13–29CrossRefGoogle Scholar
  15. 15.
    Chen M, Xu Z, Weinberger K, Sha F (2012) Marginalized denoising autoencoders for domain adaptation. In: Proceedings of the 29th international conference on machine learning, pp 767–774Google Scholar
  16. 16.
    Maaten L, Chen M, Tyree S, Weinberger KQ (2013) Learning with marginalized corrupted features. In: Proceedings of the 30th international conference on machine learning, pp 410–418Google Scholar
  17. 17.
    Herbrich R, Graepel T (2004) Invariant pattern recognition by semidefinite programming machines. In: Advances in neural information processing systems, pp 33–40Google Scholar
  18. 18.
    Teo CH, Globerson A, Roweis ST, Smola AJ (2007) Convex learning with invariances. In: Advances in neural information processing systems, pp 1489–1496Google Scholar
  19. 19.
    Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov RR (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580
  20. 20.
    Wager S, Wang S, Liang PS (2013) Dropout training as adaptive regularization. In: Advances in neural information processing systems, pp 351–359Google Scholar
  21. 21.
    Wang S, Manning C (2013) Fast dropout training. In: Proceedings of the 30th international conference on machine learning, pp 118–126Google Scholar
  22. 22.
    Qian Q, Hu J, Jin R, Pei J, Zhu S (2014) Distance metric learning using dropout: a structured regularization approach. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 323–332Google Scholar
  23. 23.
    Lawrence ND, Schölkopf B (2001) Estimating a kernel fisher discriminant in the presence of label noise. In: Proceedings of the 18th international conference on machine learning, Citeseer, pp 306–313Google Scholar
  24. 24.
    Chen M, Zheng A, Weinberger K (2013) Fast image tagging. In: Proceedings of the 30th international conference on machine Learning, pp 1274–1282Google Scholar
  25. 25.
    Li Y, Yang M, Xu Z, Zhang ZM (2016) Learning with marginalized corrupted features and labels together. In: Thirtieth AAAI conference on artificial intelligence, pp 1251–1257Google Scholar
  26. 26.
    Huang GB, Chen L (2007) Convex incremental extreme learning machine. Neurocomputing 70(16):3056–3062CrossRefGoogle Scholar
  27. 27.
    Huang GB, Chen L (2008) Enhanced random search based incremental extreme learning machine. Neurocomputing 71(16):3460–3468CrossRefGoogle Scholar
  28. 28.
    Duda RO, Hart PE, Stork DG (2012) Pattern classification. Wiley, HobokenzbMATHGoogle Scholar
  29. 29.
    Allen DM (1974) The relationship between variable selection and data agumentation and a method for prediction. Technometrics 16:125–127MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    Suykens JAK, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9(3):293–300CrossRefzbMATHGoogle Scholar
  31. 31.
    Golub TR, Slonim DK, Tamayo P et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537CrossRefGoogle Scholar
  32. 32.
    Blake CL, Merz CJ (1998) UCI repository of machine learning databases.

Copyright information

© The Natural Computing Applications Forum 2017

Authors and Affiliations

  1. 1.Beijing Institute of TechnologyBeijingChina
  2. 2.Beijing Normal UniversityBeijingChina

Personalised recommendations