Applied Intelligence

, Volume 49, Issue 7, pp 2522–2545 | Cite as

Sparse semi-autoencoders to solve the vanishing information problem in multi-layered neural networks

  • Ryotaro KamimuraEmail author
  • Haruhiko Takeuchi


The present paper aims to propose a new neural network called “sparse semi-autoencoder” to overcome the vanishing information problem inherent to multi-layered neural networks. The vanishing information problem represents a natural tendency of multi-layered neural networks to lose information in input patterns as well as training errors, including also natural reduction in information due to constraints such as sparse regularization. To overcome this problem, two methods are proposed here, namely, input information enhancement by semi-autoencoders and the separation of error minimization and sparse regularization by soft pruning. First, we try to enhance information in input patterns to prevent the information from decreasing when going through multi-layers. The information enhancement is realized in a form of new architecture called “semi-autoencoders”, in which information in input patterns is forced to be given to all hidden layers to keep the original information in input patterns as much as possible. Second, information reduction by the sparse regularization is separated from a process of information acquisition as error minimization. The sparse regularization is usually applied in training autoencoders, and it has a natural tendency to decrease information by restricting the information capacity. This information reduction in terms of the penalties tends to eliminate even necessary and important information, because of the existence of many parameters to harmonize the penalties with error minimization. Thus, we introduce a new method of soft pruning, where information acquisition of error minimization and information reduction of sparse regularization are separately applied without a drastic change in connection weights, as is the case of the pruning methods. The two methods of information enhancement and soft pruning try jointly to keep the original information as much as possible and particularly to keep necessary and important information by enabling the making of a flexible compromise between information acquisition and reduction. The method was applied to the artificial data set, eye-tracking data set, and rebel forces participation data set. With the artificial data set, we demonstrated that the selectivity of connection weights increased by the soft pruning, giving sparse weights, and the final weights were naturally interpreted. Then, when it was applied to the real data set of eye tracking, it was confirmed that the present method outperformed the conventional methods, including the ensemble methods, in terms of generalization. In addition, for the eye-tracking data set, we could interpret the final results according to the conventional eye-tracking theory of choice process. Finally, the rebel data set showed the possibility of detailed interpretation of relations between inputs and outputs. However, it was also found that the method had the limitation that the selectivity by the soft pruning could not be increased.


Multi-layered neural networks Autoencoder Semi-autoencoder Sparsity Softpruning Information augmentation Generalization Interpretation Vanishing information 



We are very grateful to an editor and two reviewers for valuable comments on the paper. This research is supported by the Japan Society for the Promotion of Science under the Grants-in-Aid for Scientific Research-grant 16K00339.


  1. 1.
    Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertainty Fuzziness Knowledge Based Syst 6(02):107–116zbMATHGoogle Scholar
  2. 2.
    Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166Google Scholar
  3. 3.
    Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 249–256Google Scholar
  4. 4.
    Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117Google Scholar
  5. 5.
    Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: Aistats, vol 15, p 275Google Scholar
  6. 6.
    Hinton GE, Osindero S, Teh Y-W (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554MathSciNetzbMATHGoogle Scholar
  7. 7.
    Bengio Y, Lamblin P, Popovici D, Larochelle H, et al (2007) Greedy layer-wise training of deep networks. Adv Neural Inf Proces Syst 19:153–160Google Scholar
  8. 8.
    Shannon CE (2001) A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review 5(1):3–55MathSciNetGoogle Scholar
  9. 9.
    Shannon CE (1951) Prediction and entropy of printed english. Bell Syst Tech J 30(1):50–64zbMATHGoogle Scholar
  10. 10.
    Abramson N (1963) Information theory and coding. McGraw-Hill, New YorkGoogle Scholar
  11. 11.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778Google Scholar
  12. 12.
    He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: European conference on computer vision. Springer, pp 630–645Google Scholar
  13. 13.
    Szegedy C, Ioffe S , Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI, pp 4278–4284Google Scholar
  14. 14.
    Hanson SJ , Pratt LY (1989) Comparing biases for minimal network construction with back-propagation. In: Advances in neural information processing systems, pp 177–185Google Scholar
  15. 15.
    Lecun Y, Denker JS , Solla SA (1990) Optimal brain damage. In: Advances in neural information processing systems, pp 598–605Google Scholar
  16. 16.
    Andrews R, Diederich J, Tickle AB (1995) Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowl-Based Syst 8(6):373–389zbMATHGoogle Scholar
  17. 17.
    Benítez JM, Castro JL, Requena I (1997) Are artificial neural networks black boxes? IEEE Trans Neural Netw 8(5):1156–1164Google Scholar
  18. 18.
    Srinivas S, Babu RV (2015) Data-free parameter pruning for deep neural networks. arXiv:1507.06149
  19. 19.
    Olshausen BA, Field DJ (1997) Sparse coding with an overcomplete basis set: a strategy employed by v1? Vision Res 37(23):3311–3325Google Scholar
  20. 20.
    Lee H, Ekanadham C, Ng AY (2008) Sparse deep belief net model for visual area v2. In: Advances in neural information processing systems, pp 873–880Google Scholar
  21. 21.
    Nair V, Hinton GE (2009) 3d object recognition with deep belief nets. In: Advances in neural information processing systems, pp 1339–1347Google Scholar
  22. 22.
    Ng A (2011) Sparse autoencoder, vol. 72 of CS294a lecture notesGoogle Scholar
  23. 23.
    Zhang X, Dou H, Ju T, Xu J, Zhang S (2016) Fusing heterogeneous features from stacked sparse autoencoder for histopathological image analysis. IEEE Journal of Biomedical and Health Informatics 20 (5):1377–1383Google Scholar
  24. 24.
    Xu J, Xiang L, Hang R, Wu J (2014) Stacked sparse autoencoder (ssae) based framework for nuclei patch classification on breast cancer histopathology. In: 2014 IEEE 11th international symposium on biomedical imaging (ISBI). IEEE, pp 999–1002Google Scholar
  25. 25.
    Tao C, Pan H, Li Y, Zou Z (2015) Unsupervised spectral–spatial feature learning with stacked sparse autoencoder for hyperspectral imagery classification. IEEE Geosci Remote Sens Lett 12(12):2438–2442Google Scholar
  26. 26.
    Deng J, Zhang Z, Marchi E, Schuller B (2013) Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: 2013 Humaine association conference on affective computing and intelligent interaction. IEEE, pp 511–516Google Scholar
  27. 27.
    Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507MathSciNetzbMATHGoogle Scholar
  28. 28.
    Hinton GE (2012) A practical guide to training restricted boltzmann machines. In: Neural networks: tricks of the trade. Springer, pp 599–619Google Scholar
  29. 29.
    Ciresan DC, Meier U, Gambardella LM, Schmidhuber J (2010) Deep, big, simple neural nets for handwritten digit recognition. Neural Comput 22(12):3207–3220Google Scholar
  30. 30.
    Makhzani A, Frey B (2013) K-sparse autoencoders. arXiv:1312.5663
  31. 31.
    Cheng Y, Wang D, Zhou P, Zhang T (2017) A survey of model compression and acceleration for deep neural networks. arXiv:1710.09282
  32. 32.
    Rudy J, Ding W, Im DJ, Taylor GW (2014) Neural network regularization via robust weight factorization. arXiv:1412.6630
  33. 33.
    Bach F, Jenatton R, Mairal J, Obozinski G (2012) Optimization with sparsity-inducing penalties. Foundations and Trends® in Machine Learning 4(1):1–106zbMATHGoogle Scholar
  34. 34.
    Vincent P, Larochelle H, Bengio Y, Manzagol P-A (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on machine learning. ACM, pp 1096–1103Google Scholar
  35. 35.
    Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A (2010) Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11(Dec):3371–3408MathSciNetzbMATHGoogle Scholar
  36. 36.
    Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov RR (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580
  37. 37.
    Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958MathSciNetzbMATHGoogle Scholar
  38. 38.
    Castellano G, Fanelli AM (1999) Variable selection using neural-network models. Neurocomputing 31:1–13Google Scholar
  39. 39.
    Oliveira GG, Pedrollo OC, Castro NM (2015) Simplifying artificial neural network models of river basin behaviour by an automated procedure for input variable selection. Eng Appl Artif Intell 40:47–61Google Scholar
  40. 40.
    Olden JD, Joy MK, Death RG (2004) An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data. Ecol Model 178(3):389–397Google Scholar
  41. 41.
    Papadokonstantaks S, Lygeros A, Jacobsson SP (2005) Comparison of recent methods for inference of variable influence in neural networks. Neural Netw 19:500–513zbMATHGoogle Scholar
  42. 42.
    Gómez-Carracedo M, Andrade J, Carrera G, Aires-de Sousa J, Carlosena A, Prada D (2010) Combining kohonen neural networks and variable selection by classification trees to cluster road soil samples. Chemometr Intell Lab Syst 102(1):20–34Google Scholar
  43. 43.
    May R, Dandy G, Maier H (2011) Review of input variable selection methods for artificial neural networks. In: Suzuki K (ed) Artificial neural networks-methodological advances and biomedical applications, InTech, pp 19–44Google Scholar
  44. 44.
    Oyefusi A (2008) Oil and the probability of rebel participation among youths in the Niger delta of nigeria. J Peace Res 45(4):539–555Google Scholar
  45. 45.
    Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140zbMATHGoogle Scholar
  46. 46.
    Breiman L (2001) Random forests. Mach Learn 45(1):5–32zbMATHGoogle Scholar
  47. 47.
    Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232MathSciNetzbMATHGoogle Scholar
  48. 48.
    Friedman J, Hastie T, Tibshirani R, et al (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28(2):337–407MathSciNetzbMATHGoogle Scholar
  49. 49.
    Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844Google Scholar
  50. 50.
    Schapire RE, Freund Y, Bartlett P, Lee WS, et al (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann Stat 26(5):1651–1686MathSciNetzbMATHGoogle Scholar
  51. 51.
    Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37(3):297–336zbMATHGoogle Scholar
  52. 52.
    Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2008) Rusboost: Improving classification performance when training data is skewed. In: 19th International conference on pattern recognition, 2008. ICPR 2008. IEEE, pp 1–4Google Scholar
  53. 53.
    Warmuth MK , Liao J, Rätsch G (2006) Totally corrective boosting algorithms that maximize the margin. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 1001–1008Google Scholar
  54. 54.
    Dietterich TG (2000) Ensemble methods in machine learning. In: International workshop on multiple classifier systems. Springer, pp 1–15Google Scholar
  55. 55.
    Riedl R, Brandstätter E, Roithmayr F (2008) Identifying decision strategies: a process-and outcome-based classification method. Behav Res Methods 40(3):795–807Google Scholar
  56. 56.
    Glaholt MG, Reingold EM (2011) Eye movement monitoring as a process tracing methodology in decision making research. Journal of Neuroscience, Psychology, and Economics 4(2):125– 146Google Scholar
  57. 57.
    Gere A, Danner L, de Antoni N, Kovács S, Dürrschmid K, Sipos L (2016) Visual attention accompanying food decision process: an alternative approach to choose the best models. Food Qual Prefer 51:1–7Google Scholar
  58. 58.
    Russo JE, Leclerc F (1994) An eye-fixation analysis of choice processes for consumer nondurables. J Consum Res 21(2):274–290Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.IT Education CenterTokai UniversityHiratsukaJapan
  2. 2.Human Informatics Research InstituteNational Institute of Advanced Industrial Science and Technology (AIST)TsukubaJapan

Personalised recommendations