Training Neural Networks Using Predictor-Corrector Gradient Descent

Nesky, Amy; Stout, Quentin F.

doi:10.1007/978-3-030-01424-7_7

Training Neural Networks Using Predictor-Corrector Gradient Descent

Amy Nesky¹⁸ &
Quentin F. Stout¹⁸

Conference paper
First Online: 27 September 2018

8516 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11141))

Abstract

We improve the training time of deep feedforward neural networks using a modified version of gradient descent we call Predictor-Corrector Gradient Descent (PCGD). PCGD uses predictor-corrector inspired techniques to enhance gradient descent. This method uses a sparse history of network parameter values to make periodic predictions of future parameter values in an effort to skip unnecessary training iterations. This method can cut the number of training epochs needed for a network to reach a particular testing accuracy by nearly one half when compared to stochastic gradient descent (SGD). PCGD can also outperform, with some trade-offs, Nesterov’s Accelerated Gradient (NAG).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
One caution ought to be mentioned here: brain predictions also enable prejudices, so one must be careful how much trust is placed in predictions.
2.
Note that the jacobian, J, is not specific to the column of \(A_{t+1}\).

References

Andrychowicz, M., et al.: Learning to learn by gradient descent by gradient descent. In: NIPS (2016)
Google Scholar
Beck, A., et al.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Article MathSciNet Google Scholar
Cassioli, A., et al.: An incremental least squares algorithm for large scale linear classification. Eur. J. Oper. Res. 224(3), 560–565 (2013)
Article MathSciNet Google Scholar
Daniel, C., et al.: Learning step size controllers for robust neural network training. In: AAAI (2016)
Google Scholar
Dozat, T.: Incorporating Nesterov momentum into Adam. In: ICLR Workshop (2016)
Google Scholar
Duchi, J., et al.: Adaptive subgradient methods for online learning and stochastic optimization. JMLR 12, 2121–2159 (2011)
MathSciNet MATH Google Scholar
Heeger, D.J.: Theory of cortical function. Proc. Natl. Acad. Sci. USA 114(8), 1773–1782 (2016)
Article MathSciNet Google Scholar
Ho, Q., et al.: More effective distributed ML via a stale synchronous parallel parameter server. In: NIPS, pp. 1223–1231 (2013)
Google Scholar
Hratchian, H., et al.: Steepest descent reaction path integration using a first-order predictor-corrector method. J. Chem. Phys. 133(22), 224101 (2010)
Article Google Scholar
Kingma, D., et al.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical report, Computer Science, University of Toronto (2009)
Google Scholar
Krizhevsky, A.: cuda-convnet. Technical report, Computer Science, University of Toronto (2012)
Google Scholar
Krizhevsky, A., et al.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)
Google Scholar
Luca, M.D., et al.: Optimal perceived timing: integrating sensory information with dynamically updated expectations. Sci. Rep. 6, 28563 (2016)
Article Google Scholar
Neelakantan, A., et al.: Adding gradient noise improves learning for very deep networks. arXiv:1511.06807 (2015)
Nesky, A., et al.: Training neural networks using predictor-corrector gradient descent: Appendix (2018). http://www-personal.umich.edu/~anesky/PCGD_appendix.pdf
Nesterov, Y.: A method of solving a convex programming problem with convergence rate o(1/sqr(k)). Soviet Mathematics Doklady 27, 372–376 (1983)
MATH Google Scholar
Netzer, Y., et al.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011)
Google Scholar
Polyak, B.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Article Google Scholar
Scieur, D., et al.: Regularized nonlinear acceleration. In: NIPS (2016)
Google Scholar
Simonetto, A., et al.: Prediction-correction methods for time-varying convex optimization. In: IEEE Asilomar Conference on Signals, Systems and Computers (2015)
Google Scholar
Süli, E., et al.: An Introduction to Numerical Analysis, pp. 325–329 (2003)
Google Scholar
Tieleman, T., et al.: Lecture 6a - rmsprop. COURSERA: Neural Networks for Machine Learning (2012)
Google Scholar
Zeiler, M.D.: ADADELTA: an adaptive learning rate method. arXiv:1212.5701 (2012)
Zhang, Y., et al.: Prediction-adaptation-correction recurrent neural networks for low-resource language speech recognition. arXiv:1510.08985 (2015)
Zhang, Y., et al.: Speech recognition with prediction-adaptation-correction recurrent neural networks. In: IEEE ICASSP (2015)
Google Scholar

Download references

Acknowledgments

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1256260. This work used the Extreme Science and Engineering Discovery Environment, which is supported by National Science Foundation grant number OCI-1053575. Specifically, it used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center.

Author information

Authors and Affiliations

Computer Science and Engineering, University of Michigan, Ann Arbor, MI, 48109, USA
Amy Nesky & Quentin F. Stout

Authors

Amy Nesky
View author publications
You can also search for this author in PubMed Google Scholar
Quentin F. Stout
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amy Nesky .

Editor information

Editors and Affiliations

Czech Academy of Sciences, Prague 8, Czech Republic
Věra Kůrková
Open University of Cyprus, Latsia, Cyprus
Yannis Manolopoulos
CITEC Bielefeld University, Bielefeld, Germany
Barbara Hammer
Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
University of Piraeus, Piraeus, Greece
Ilias Maglogiannis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nesky, A., Stout, Q.F. (2018). Training Neural Networks Using Predictor-Corrector Gradient Descent. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds) Artificial Neural Networks and Machine Learning – ICANN 2018. ICANN 2018. Lecture Notes in Computer Science(), vol 11141. Springer, Cham. https://doi.org/10.1007/978-3-030-01424-7_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-01424-7_7
Published: 27 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01423-0
Online ISBN: 978-3-030-01424-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics