# Automatic Relevance Determination (ARD)

Chapter

## Abstract

A scheme for the systematic adaptation of the random-parameter distribution widths is introduced. Weights exiting the same input node are combined into a weight group, and the distribution widths of the weight groups are adjusted during training by a method similar to Manhattan updating. A practical algorithm is derived, and an empirical demonstration shows that irrelevant inputs are detected and effectively switched off. The whole scheme was inspired by and is akin to Neal’s and MacKay’s automatic relevance determination. It will therefore be referred to by the same name.

## Preview

Unable to display preview. Download preview PDF.

## Reference

- The term hyperparameter will be borrowed since the
_{rand}^{g}are parameters of a*prior*probability distribution, and are closely related to the hyperparameters*α*_{g}in MacKays’s work [39].Google Scholar - By defining
*σ*as the exponential of*ρ*its positivity is always ensured. The other reason for introducing*ρ*is that*σ*is a*scale*parameter. Since a non-informative prior for a scale parameter is uniform on a logarithmic scale (as discussed in Section 11.2),*ρ*is the natural parameter for any adaptation scheme.Google Scholar - The nature of the inconsistency of scheme ARD2 (vide infra) becomes clearer when the update rule for the
*ρ*_{g}s is analysed. As will be shown shortly in (15.15), the gradient of*E*with respect to*ρ*depends on all the weights exiting the input units, that is both the weights feeding into the*S*-layer and those feeding into the*g*-layer. However, as illustrated above and discussed in a more general way in [7], pp.340–342, these weights scale differently when the training data are subjected to a linear transformation. Consequently, the sign of the gradient in (15.15), and hence the network’s ‘assumption’ about the significance of the different inputs, can change as the result of such a linear transformation. This is a striking inconsistency, since linear transformations of the data should lead to equivalent networks which differ only by the linear transformation of the weights.Google Scholar - The method of simple weight decay, with
*á*_{k}= 0.01 for all weight grous, was applied for regularization; see Section 12.1.2 for details.Google Scholar

## Copyright information

© Springer-Verlag London Limited 1999