Automatic Relevance Determination (ARD)
A scheme for the systematic adaptation of the random-parameter distribution widths is introduced. Weights exiting the same input node are combined into a weight group, and the distribution widths of the weight groups are adjusted during training by a method similar to Manhattan updating. A practical algorithm is derived, and an empirical demonstration shows that irrelevant inputs are detected and effectively switched off. The whole scheme was inspired by and is akin to Neal’s and MacKay’s automatic relevance determination. It will therefore be referred to by the same name.
Unable to display preview. Download preview PDF.
- The term hyperparameter will be borrowed since the randg are parameters of a prior probability distribution, and are closely related to the hyperparameters αg in MacKays’s work .Google Scholar
- By defining σ as the exponential of ρ its positivity is always ensured. The other reason for introducing ρ is that σ is a scale parameter. Since a non-informative prior for a scale parameter is uniform on a logarithmic scale (as discussed in Section 11.2), ρ is the natural parameter for any adaptation scheme.Google Scholar
- The nature of the inconsistency of scheme ARD2 (vide infra) becomes clearer when the update rule for the ρ gs is analysed. As will be shown shortly in (15.15), the gradient of E with respect to ρ depends on all the weights exiting the input units, that is both the weights feeding into the S-layer and those feeding into the g-layer. However, as illustrated above and discussed in a more general way in , pp.340–342, these weights scale differently when the training data are subjected to a linear transformation. Consequently, the sign of the gradient in (15.15), and hence the network’s ‘assumption’ about the significance of the different inputs, can change as the result of such a linear transformation. This is a striking inconsistency, since linear transformations of the data should lead to equivalent networks which differ only by the linear transformation of the weights.Google Scholar
- The method of simple weight decay, with ák = 0.01 for all weight grous, was applied for regularization; see Section 12.1.2 for details.Google Scholar