SAOSA: Stable Adaptive Optimization for Stacked Auto-encoders


The stacked auto-encoders are considered deep learning algorithms automatically extracting meaningful unsupervised features from the input data using a hierarcfhical learning process. The parameters are learnt layer-by-layer in each auto-encoder (AE). As optimization is one of the main components of the neural networks and auto-encoders, the learning rate is one of the crucial hyper-parameters of neural networks and AE. This issue on a large scale and especially sparse data sets is more important. In this paper, we adapt the learning rate for special AE corresponding to various components of AE networks in each stochastic gradient calculation and analyze the theoretical convergence of back-propagation learning for the proposed method. We also promote our methodology for online adaptive optimizations suitable for deep learning. We obtain promising results compared to constant learning rates on the (1) MNIST digit, (2) blogs-Gender-100 text, (3) smartphone based recognition of human activities and postural transitions time series, and (4) EEG brainwave feeling emotions time series classification tasks using a single machine.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18





Multi-layer perceptron


Stochastic gradient descent


Stable gradient descent


Natural gradient descent


Kalman-based stochastic gradient descent


Root mean square


Resilient back-propagation rule


Kullback-leibler divergence


Nesterov accelerated gradient


Adaptive sub-gradient method for online Learning and Stochastic Optimization


Adaptive learning rate method


Root mean square propagation method


Adaptive moment estimation method


Nesterov-accelerated adaptive moment estimation method


Exponential moving average method


Sable adaptive network based fuzzy inference system


Particle swarm optimization




Stack auto-encoder


Long-short term memory


Convolutional neural network


  1. 1.

    Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al. (2016) Tensorflow: A system for large-scale machine learning. In: 12th \(\{\)USENIX\(\}\) symposium on operating systems design and implementation (\(\{\)OSDI\(\}\) 16), pp 265–283

  2. 2.

    Amari SI (1998) Natural gradient works efficiently in learning. Neural Comput 10(2):251–276

    Article  Google Scholar 

  3. 3.

    Banakar A (2011) Lyapunov stability analysis of gradient descent-learning algorithm in network training. In: ISRN applied mathematics 2011

  4. 4.

    Baydin AG, Cornish R, Rubio DM, Schmidt M, Wood F (2018) Online learning rate adaptation with hypergradient descent. In: Sixth international conference on learning representations (ICLR), Vancouver, Canada, April 30–May 3, 2018

  5. 5.

    Bengio Y (2012) Deep learning of representations for unsupervised and transfer learning. In: Proceedings of ICML workshop on unsupervised and transfer learning, pp 17–36

  6. 6.

    Bengio Y (2012) Practical recommendations for gradient-based training of deep architectures. In: Montavon G, Orr G, KR Müller (eds) Neural networks: tricks of the trade. Springer-Verlag, Berlin, Heidelberg, pp 437–478

  7. 7.

    Bengio Y, Lamblin P, Popovici D, Larochelle H (2007) Greedy layer-wise training of deep networks. In: Proceedings of the 19th international conference on neural information processing systems, pp 153–160

  8. 8.

    Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828

    Article  Google Scholar 

  9. 9.

    Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(Feb):281–305

    MathSciNet  MATH  Google Scholar 

  10. 10.

    Bertsekas DP (2011) Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Optim Mach Learn 2010(1–38):3

    Google Scholar 

  11. 11.

    Bousquet O, Elisseeff A (2002) Stability and generalization. J Mach Learn Res 2(Mar):499–526

    MathSciNet  MATH  Google Scholar 

  12. 12.

    Chen R, Qu Y, Li C, Zeng K, Xie Y, Li C (2020) Single-image super-resolution via joint statistic models-guided deep auto-encoder network. Neural Comput Applic 32:4885–4896

    Article  Google Scholar 

  13. 13.

    Dozat T (2016) Incorporating nesterov momentum into adam. In: International conference on learning representations

  14. 14.

    Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(Jul):2121–2159

    MathSciNet  MATH  Google Scholar 

  15. 15.

    Dumas T, Roumy A, Guillemot C (2018) Autoencoder based image compression: can the learning be quantization independent? In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 1188–1192

  16. 16.

    Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge

    Google Scholar 

  17. 17.

    Haykin SS, Haykin SS, Haykin SS, Elektroingenieur K, Haykin SS (2009) Neural networks and learning machines, vol 3. Pearson, Upper Saddle River

    Google Scholar 

  18. 18.

    Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    MathSciNet  Article  Google Scholar 

  19. 19.

    Izzo D, Tailor D, Vasileiou T (2018) On the stability analysis of optimal state feedbacks as represented by deep neural models. arXiv preprint arXiv:1812.02532

  20. 20.

    Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

  21. 21.

    Kuzborskij I, Lampert CH (2017) Data-dependent stability of stochastic gradient descent. arXiv preprint arXiv:1703.01678

  22. 22.

    LeCun YA, Bottou L, Orr GB, Müller KR (2012) Efficient backprop. In: Montavon G, Orr R, KR Müller (eds) Neural networks: tricks of the trade. Springer-Verlag, Berlin, Heidelberg, pp 9–48

  23. 23.

    Liu L, Jiang H, He P, Chen W, Liu X, Gao J, Han J (2019) On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265

  24. 24.

    Luo L, Xiong Y, Liu Y, Sun X (2019) Adaptive gradient methods with dynamic bound of learning rate. In: Proceedings of the 7th international conference on learning representations, New Orleans, Louisiana

  25. 25.

    Ma H, Ma S, Xu Y, Zhu M (2018) Deep marginalized sparse denoising auto-encoder for image denoising. J Phys Conf Ser.

  26. 26.

    Mac H, Truong D, Nguyen L, Nguyen H, Tran HA, Tran D (2018) Detecting attacks on web applications using autoencoder. In: Proceedings of the ninth international symposium on information and communication technology. ACM, pp 416–421

  27. 27.

    Martens J (2010) Deep learning via hessian-free optimization. ICML 27:735–742

    Google Scholar 

  28. 28.

    Masters D, Luschi C (2018) Revisiting small batch training for deep neural networks. arXiv preprint arXiv:1804.07612

  29. 29.

    Moulines E, Bach FR (2011) Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Proceedings of the 24th international conference on neural information processing systems, pp 451–459

  30. 30.

    Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E (2015) Deep learning applications and challenges in big data analytics. J Big Data 2(1):1

    Article  Google Scholar 

  31. 31.

    Nesterov YE (1983) A method for solving the convex programming problem with convergence rate o (\(1/k^2\)). Dokl akad nauk Sssr 269:543–547

    MathSciNet  Google Scholar 

  32. 32.

    Ollivier Y et al (2018) Online natural gradient as a kalman filter. Electron J Stat 12(2):2930–2961

    MathSciNet  Article  Google Scholar 

  33. 33.

    Pascanu R, Bengio Y (2013) Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584

  34. 34.

    Patel V (2016) Kalman-based stochastic gradient method with stop condition and insensitivity to conditioning. SIAM J Optim 26(4):2620–2648

    MathSciNet  Article  Google Scholar 

  35. 35.

    Qian N (1999) On the momentum term in gradient descent learning algorithms. Neural Netw 12(1):145–151

    MathSciNet  Article  Google Scholar 

  36. 36.

    Ramezani-Kebrya A, Khisti A, Liang B (2018) On the stability and convergence of stochastic gradient descent with momentum. arXiv preprint arXiv:1809.04564

  37. 37.

    Reddi SJ, Kale S, Kumar S (2018) On the convergence of adam and beyond. In: International conference on learning.

  38. 38.

    Roux NL, Manzagol PA, Bengio Y (2007) Topmoumoute online natural gradient algorithm. In: Proceedings of the 20th international conference on neural information processing systems, pp 849–856

  39. 39.

    Ruder S (2016) An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747

  40. 40.

    Ruff L, Vandermeulen R, Goernitz N, Deecke L, Siddiqui SA, Binder A, Müller E, Kloft M (2018) Deep one-class classification. In: International conference on machine learning, pp 4393–4402

  41. 41.

    Rumelhart DE, McClelland JL, Group PR et al (1988) Parallel distributed processing, vol 1. MIT Press, Cambridge

    Google Scholar 

  42. 42.

    Shoorehdeli MA, Teshnehlab M, Sedigh A (2008) Stable learning algorithm approaches for anfis as an identifier. IFAC Proc Vol 41(2):7046–7051

    Article  Google Scholar 

  43. 43.

    Shoorehdeli MA, Teshnehlab M, Sedigh AK (2009) Identification using anfis with intelligent hybrid stable learning algorithm approaches. Neural Comput Appl 18(2):157–174

    Article  Google Scholar 

  44. 44.

    Shoorehdeli MA, Teshnehlab M, Sedigh AK, Khanesar MA (2009) Identification using anfis with intelligent hybrid stable learning algorithm approaches and stability analysis of training methods. Appl Soft Comput 9(2):833–850

    Article  Google Scholar 

  45. 45.

    Sutton R (1986) Two problems with back propagation and other steepest descent learning procedures for networks. In: Proceedings of the eighth annual conference of the cognitive science society, pp 823–832

  46. 46.

    Tieleman T, Hinton G (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw Mach Learn 4(2):26–31

    Google Scholar 

  47. 47.

    Vartouni AM, Kashi SS, Teshnehlab M (2018) An anomaly detection method to detect web attacks using stacked auto-encoder. In: 2018 6th Iranian joint congress on fuzzy and intelligent systems (CFIS). IEEE, pp 131–134

  48. 48.

    Vartouni AM, Teshnehlab M, Kashi SS (2019) Leveraging deep neural networks for anomaly-based web application firewall. IET Inf Secur 13(4):352–361

    Article  Google Scholar 

  49. 49.

    Vinyals O, Povey D (2012) Krylov subspace descent for deep learning. In: Proceedings of the 15th international conference on artificial intelligence and statistics, pp 1261–1268

  50. 50.

    Widrow B, Hoff ME (1962) Associative storage and retrieval of digital information in networks of adaptive “neurons”. In: Bernard EE (ed) Biological prototypes and synthetic systems. Springer, US, pp 160–160

    Google Scholar 

  51. 51.

    Yerramalla S, Fuller E, Mladenovski M, Cukic B (2003) Lyapunov analysis of neural network stability in an adaptive flight control system. In: Symposium on self-stabilizing systems. Springer, Spinger, pp 77–92

  52. 52.

    Yu J, Yang X, Gao F, Tao D (2016) Deep multimodal distance metric learning using click constraints for image ranking. IEEE Trans Cybernet 47(12):4014–4024

    Article  Google Scholar 

  53. 53.

    Yu J, Hong C, Rui Y, Tao D (2017) Multitask autoencoder model for recovering human poses. IEEE Trans Ind Electron 65(6):5060–5068

    Article  Google Scholar 

  54. 54.

    Zeiler MD (2012) Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701

  55. 55.

    Zhang Q, Yang LT, Chen Z, Li P (2018) A survey on deep learning for big data. Inf Fus 42:146–157

    Article  Google Scholar 

  56. 56.

    Zhao R, Yan R, Chen Z, Mao K, Wang P, Gao RX (2019) Deep learning and its applications to machine health monitoring. Mech Syst Signal Process 115:213–237

    Article  Google Scholar 

  57. 57.

    Zhou C, Paffenroth RC (2017) Anomaly detection with robust deep autoencoders. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 665–674

  58. 58.

    Zhou Z, Zhang Q, Lu G, Wang H, Zhang W, Yu Y (2019) Adashift: decorrelation and convergence of adaptive learning rate methods. In: 7th international conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019,

Download references

Author information



Corresponding author

Correspondence to Mohammad Teshnehlab.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


A Appendix: a solution for \(c^o\)

According to (21) and (25):

$$\begin{aligned} \mathbf{w} _j^o(k+1)= \mathbf{w} _j^o(k) +\frac{\mathbf{e _j(k)\mathbf{g} _j'(k)\mathbf{h} (k)}{(g_j'(k))^2\Vert \mathbf{h} (k)\Vert ^2+c^o(k)}}{\quad }for \ every \ j \end{aligned}$$

it is need \(c^o (k)\) for each iteration k. Therefore, the adaptive algorithm based on gradient descent is:

$$\begin{aligned} \mathbf{c} ^o (k)= \mathbf{c} ^o (k-1)+\rho ^o \frac{\partial J(k)}{\partial \mathbf{c} ^o (k-1)} \end{aligned}$$

It is show, we should calculate gradient of cost function J(k) relative to gradient of \(c^o (k-1)\). According to chain rule, we have:

$$\begin{aligned} \begin{aligned}&\frac{\partial J(k)}{\partial \mathbf{c} ^o(k-1)} \\&\quad =\frac{\partial J(k)}{\partial \mathbf{e} (k)} \frac{\partial \mathbf{e} (k)}{\partial \mathbf{r} (k)} \frac{\partial \mathbf{r} (k)}{\partial \mathbf{net}2 (k)} \frac{\partial \mathbf{net}2 (k)}{\partial W^o(k)} \frac{\partial W^o(k)}{\partial \varvec{\eta }^o(k-1)} \frac{\partial \varvec{\eta }^o(k-1)}{\partial \mathbf{c} ^o(k-1)} \\&\quad =-\mathbf{e} (k)\mathbf{g} '(k)\mathbf{h} (k)\frac{\partial W^o(k)}{\partial \varvec{\eta }^o(k-1)} \frac{\partial \varvec{\eta }^o(k-1)}{\partial \mathbf{c} ^o(k-1)} \end{aligned} \end{aligned}$$

If we change variable k to k-1 in (25) then,

$$\begin{aligned} \frac{\partial \eta _{j}^o(k-1)}{\partial c_{j}^o(k-1)} =\frac{-1}{((g_j'(k-1))^2\Vert \mathbf{h} (k-1)\Vert ^2+c_j^o(k-1))^2} \end{aligned}$$

and also, in (20) then,

$$\begin{aligned} \frac{\partial W^o(k)}{\partial \eta _{j}^o(k-1)} =e_j(k-1)g_j'(k-1)\mathbf{h} (k-1) \end{aligned}$$

Finally, according to (51), (52) and (50) we have:

$$\begin{aligned} \begin{aligned} \frac{\partial J(k)}{\partial c_j^o(k-1)}= \frac{e_j(k)g_j'(k)\mathbf{h} (k) \times e_j(k-1)g_j'(k-1)\mathbf{h} (k-1)}{((g_j'(k-1))^2\Vert \mathbf{h} (k-1)\Vert ^2+c_j^o(k-1))^2} \end{aligned} \end{aligned}$$

B Appendix: a solution for \(c^h\)

According to (30) and (32):

$$\begin{aligned} \mathbf{w} _i^h(k+1)= \mathbf{w} _i^h(k) +\frac{e_{j}(k)g_{j}'(k)\mathbf{w} _j^o(k)\mathbf{f} '(k)x_i(k)}{(g_j'(k))^2\Vert \mathbf{f} '(k)\Vert ^2 \Vert \mathbf{w} _j^o(k)\Vert ^2\Vert \mathbf{x} (k)\Vert ^2+c^h(k)} \end{aligned}$$

it is need \(c^h (k)\) for each iteration k. Therefore, the adaptive algorithm based on gradient descent is:

$$\begin{aligned} \mathbf{c} ^h (k)= \mathbf{c} ^h (k-1)+\rho ^h \frac{\partial J(k)}{\partial \mathbf{c} ^h (k-1)} \end{aligned}$$

It is show, we should calculate gradient of cost function J(k) relative to gradient of \(c^h (k-1)\). According to chain rule, we have:

$$\begin{aligned} \begin{aligned}&\frac{\partial J(k)}{\partial \mathbf{c} ^h(k-1)} \frac{\partial J(k)}{\partial \mathbf{e} (k)} \frac{\partial \mathbf{e} (k)}{\partial \mathbf{r} (k)} \frac{\partial \mathbf{r} (k)}{\partial \mathbf{net}2 (k)} \\&\quad =\frac{\partial \mathbf{net}2 (k)}{\partial \mathbf{h} (k)} \frac{\partial \mathbf{h} (k)}{\partial \mathbf{net}1 (k)} \frac{\partial \mathbf{net}1 (k)}{\partial W^h(k)} \frac{\partial W^h(k)}{\partial \varvec{\eta }^h(k-1)} \frac{\partial \varvec{\eta }^h(k-1)}{\partial \mathbf{c} ^h(k-1)} \\&\quad =-\mathbf{e} (k)\mathbf{g} '(k)W^o(k)\mathbf{f} '(k)\mathbf{x} (k)\frac{\partial W^h(k)}{\partial \varvec{\eta }^h(k-1)} \frac{\partial \varvec{\eta }^h(k-1)}{\partial \mathbf{c} ^h(k-1)} \end{aligned} \end{aligned}$$

If we change variable k to k-1 in (32) then,

$$\begin{aligned} \begin{aligned}&\frac{\partial \eta _{j}^h(k-1)}{\partial c_{j}^h(k-1)} \\&\quad =\frac{-1}{((g_j'(k-1))^2\Vert \mathbf{f} '(k-1)\Vert ^2 \Vert \mathbf{w} _j^o(k-1)\Vert ^2\Vert \mathbf{x} (k-1)\Vert ^2+c_j^h(k-1))^2} \end{aligned} \end{aligned}$$

and also, in (27) then,

$$\begin{aligned} \frac{\partial W^h(k)}{\partial \eta _{j}^h(k-1)} = e_j(k-1)g_j'(k-1)\mathbf{w} _j^o(k-1)\mathbf{f} '(k-1)x_i(k-1) \end{aligned}$$

Finally, according to (57), (58) and (56) we have:

$$\begin{aligned} \begin{aligned}&\frac{\partial J(k)}{\partial c_j^o(k-1)}= e_j(k)g_j'(k)\mathbf{w} _j^o(k)\mathbf{f} '(k)x_i(k) \\&\quad \times \frac{ e_j(k-1)g_j'(k-1)\mathbf{w} _j^o(k-1)\mathbf{f} '(k-1)x_i(k-1)}{((g_j'(k-1))^2\Vert \mathbf{f} '(k-1)\Vert ^2 \Vert \mathbf{w} _j^o(k-1)\Vert ^2\Vert \mathbf{x} (k-1)\Vert ^2+c_j^h(k-1))^2} \\ \end{aligned} \end{aligned}$$

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Moradi Vartouni, A., Teshnehlab, M. & Sedighian Kashi, S. SAOSA: Stable Adaptive Optimization for Stacked Auto-encoders. Neural Process Lett (2020).

Download citation


  • Deep neural network
  • Stacked auto-encoder
  • Cost function optimization
  • Stable adaptive learning