Abstract
A quantitative and practical Bayesian framework is described for learning of mappings in feedforward networks. The framework makes possible: (1) objective comparisons between solutions using alternative network architectures; (2) objective stopping rules for network pruning or growing procedures; (3) objective choice of magnitude and type of weight decay terms or additive regularisers (for penalising large weights, etc.); (4) a measure of the effective number of well-determined parameters in a model; (5) quantified estimates of the error bars on network parameters and on network output; (6) objective comparisons with alternative learning and interpolation models such as splines and radial basis functions. The Bayesian ‘evidence’ automatically embodies ‘Occam’s razor,’ penalising over-flexible and over-complex models. The Bayesian approach helps detect poor underlying assumptions in learning models. For learning models well matched to a problem, a good correlation between generalisation ability and the Bayesian evidence is obtained.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Y.S. Abu-Mostafa (1990). ‘The Vapnik—Chervonenkis dimension: information versus complexity in learning’, Neural Computation, 13, 312–317.
Y.S. Abu-Mostafa (1990). ‘Learning from hints in neural networks’, J. Complexity, 6, 192–198.
C.M. Bishop (1991). ‘Exact calculation of the Hessian matrix for the multilayer perceptron’, submitted to Neural Computation.
J.S. Denker and Y. Le Cun (1991). ‘Transforming neural-net output levels to probability distributions’, in Advances in neural information processing systems 3, ed. R.P. Lippmann et. al., 853–859, Morgan Kaufmann.
S.F. Gull (1989). ‘Developments in Maximum entropy data analysis’, in J. Skilling, ed., 53–71.
I. Guyon, V.N. Vapnik, B.E. Boser, L.Y. Bottou and S.A. Solla (1992). ‘Structural risk minimization for character recognition’, in Advances in neural information processing systems 4, ed. J.E. Moody, S.J. Hanson and R.P. Lippmann, Morgan Kaufmann.
R. Hanson, J. Stutz and P. Cheeseman (1991). ‘Bayesian classification theory’, NASA Ames TR FIA–90–12–7–01.
D. Haussier, M. Kearns and R. Schapire (1991). ‘Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension’, Preprint.
G.E. Hinton and T.J. Sejnowski (1986). ‘Learning and relearning in Boltzmann machines’, in Parallel Distributed Processing, Rumelhart et. al., MIT Press.
C. Ji, R.R. Snapp and D. Psaltis (1990). ‘Generalizing smoothness constraints from discrete samples’, Neural Computation, 22, 188–197.
Y. Le Cun, J.S. Denker and S.S. Solla (1990). ‘Optimal Brain Damage’, in Advances in neural information processing systems 2, ed. David S. Touretzky, 598–605, Morgan Kaufmann.
W.T. Lee and M.F. Tenorio (1991). ‘On Optimal Adaptive Classifier Design Criterion - How many hidden units are necessary for an optimal neural network classifier?’, Purdue University TREE-91–5.
E. Levin, N. Tishby and S. Solla (1989). ‘A statistical approach to learning and generalization in layered neural networks’, in COLT ‘89: 2nd workshop on computational learning theory, 245–260.
D.J.C. MacKay (1992b). `A practical Bayesian framework for backprop networks’, Neural computation, 43.
D.J.C. MacKay (1992c) ‘Bayesian interpolation’, this volume.
D.J.C. MacKay (1992f) ‘The evidence framework applied to classification networks’, in preparation.
J.E. Moody (1991). ‘Note on generalization, regularization and architecture selection in nonlinear learning systems’, in First IEEE-SP Workshop on neural networks for signal processing, IEEE Computer society press.
S.J. Nowlan (1991). ‘Soft competitive adaptation: neural network learning algorithms based on fitting statistical mixtures’, Carnegie Mellon University Doctoral thesis CS-91–126.
W.H. Press, B.P. Flannery, S.A. Teukolsky and W.T. Vetterling (1988). Numerical Recipes in C, Cambridge.
D.E. Rumelhart, G.E. Hinton and R.J. Williams (1986). ‘Learning representations by back propagating errors’, Nature, 323, 533–536.
D.E. Rumelhart (1987). Cited in Ji et. al. (1990).
J. Skilling, editor (1989). Maximum Entropy and Bayesian Methods, Cambridge 1988,Kluwer.
J. Skilling (1989). ‘The eigenvalues of mega-dimensional matrices’, in J. Skilling, ed., 455–466.
N. Tishby, E. Levin and S.A. Solla (1989). ‘Consistent inference of probabilities in layered networks: predictions and generalization’, in Proc. IJCNN, Washington.
A.M. Walker (1967). ‘On the asymptotic behaviour of posterior distributions’, J. R. Stat. Soc. B, 31, 80–88.
A.S. Weigend, D.E. Rumelhart and B.A. Huberman (1991). ‘Generalization by weight-elimination with applications to forecasting’, in Advances in neural information processing systems 3., ed. R.P. Lippmann et. al., 875–882, Morgan Kaufmann.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1992 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
MacKay, D.J.C. (1992). The Evidence for Neural Networks. In: Smith, C.R., Erickson, G.J., Neudorfer, P.O. (eds) Maximum Entropy and Bayesian Methods. Fundamental Theories of Physics, vol 50. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2219-3_12
Download citation
DOI: https://doi.org/10.1007/978-94-017-2219-3_12
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-4220-0
Online ISBN: 978-94-017-2219-3
eBook Packages: Springer Book Archive