Abstract
Neural networks are widely used for nonlinear pattern recognition and regression. However, they are considered as black boxes due to lack of transparency of internal workings and lack of direct relevance of its structure to the problem being addressed making it difficult to gain insights. Furthermore, structure of a neural network requires optimization which is still a challenge. Many existing structure optimization approaches require either extensive multi-stage pruning or setting subjective thresholds for pruning parameters. The knowledge of any internal consistency in the behavior of neurons could help develop simpler, systematic and more efficient approaches to optimise network structure. This chapter addresses in detail the issue of internal consistency in relation to redundancy and robustness of network structure of feed forward networks (3-layer) that are widely used for nonlinear regression. It first investigates if there is a recognizable consistency in neuron activation patterns under all conditions of network operation such as noise and initial weights. If such consistency exists, it points to a recognizable optimum network structure for given data. The results show that such pattern does exist and it is most clearly evident not at the level of hidden neuron activation but hidden neuron input to the output neuron (i.e., weighted hidden neuron activation). It is shown that when a network has more than the optimum number of hidden neurons, the redundant neurons form clearly distinguishable correlated patterns of their weighted outputs. This correlation structure is exploited to extract the required number of neurons using correlation distance based self organising maps that are clustered using Ward clustering that optimally cluster correlated weighted hidden neuron activity patterns without any user defined criteria or thresholds, thus automatically optimizing network structure in one step. The number of Ward clusters on the SOM is the required optimum number of neurons. The SOM/Ward based optimum network is compared with that obtained using two documented pruning methods: optimal brain damage and variance nullity measure to show the efficacy of the correlation approach in providing equivalent results. Also, the robustness of the network with optimum structure is tested against perturbation of weights and confidence intervals for weights are illustrated. Finally, the approach is tested on two practical problems involving a breast cancer diagnostic system and river flow forecasting.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
S. Samarasinghe, Neural Networks for Applied Sciences and Engineering-From Fundamentals to Complex Pattern Recognition (CRC Press, 2006)
C. Bishop, Neural Networks for Pattern Recognition (Clarendon Press, Oxford, UK, 1996)
S. Haykin, Neural Networks: A comprehensive Foundation, 2nd edn. (Prentice Hall Inc, New Jersey, USA, 1999)
R. Reed, Pruning algorithms-A survey. IEEE Trans. Neural Networks 4, 740–747 (1993)
Y. Le Cun, J.S. Denker, S.A. Solla, Optimal brain damage, in Advances in Neural Information Processing (2), ed. by D.S. Touretzky (1990), pp. 598–605
B. Hassibi, D.G. Stork, G.J. Wolff, Optimal brain surgeon and general network pruning. IEEE International Conference on Neural Networks, vol. 1, (San Francisco, 1992), pp. 293–298
B. Hassibi, D.G. Stork, Second-order derivatives for network pruning: Optimal brain surgeon, in Advances in Neural Information Processing Systems, vol. 5, ed. by C. Lee Giles, S.J. Hanson, J.D. Cowan, (1993), pp. 164–171
A.P. Engelbrecht, A new pruning heuristic based on variance analysis of sensitivity information. IEEE Trans. Neural Networks 12(6), 1386–1399 (2001)
K. Hagiwara, Regularization learning, early stopping and biased estimator. Neurocomputing 48, 937–955 (2002)
M. Hagiwara, Removal of hidden units and weights for backpropagation networks. Proc. Int. Joint Conf. Neural Networks 1, 351–354 (1993)
F. Aires, Neural network uncertainty assessment using Bayesian statistics with application to remote sensing: 1. Network weights. J. Geophys. Res. 109, D10303 (2004). doi:10.1029/2003JD004173
F. Aires, Neural network uncertainty assessment using Bayesian statistics with application to remote sensing: 2. Output Error. J. Geophys. Res. 109, D10304 (2004). doi:10.1029/2003JD004174
F. Aires, Neural network uncertainty assessment using Bayesian statistics with application to remote sensing: 3. Network Jacobians. J. Geophys. Res. 109, D10305 (2004). doi:10.1029/2003JD004175
K. Warne, G. Prasad, S. Rezvani, L. Maguire, Statistical computational intelligence techniques for inferential model development: A comparative evaluation and novel proposition for fusion. Eng. Appl. Artif. Intell. 17, 871–885 (2004)
I. Rivals, L. Personnaz, Construction of Confidence Intervals for neural networks based on least squares estimation. Neural Networks 13, 463–484 (2000)
E.J. Teoh, K.C. Tan, C. Xiang, Estimating the number of hidden neurons in a feed forward network using the singular value decomposition IEEE Trans. Neural Networks 17(6), (2006)
C. Xian, S.Q. Ding, T.H. Lee, Geometrical interpretation and architecture selection of MLP, IEEE Trans. Neural Networks 16(1), (2005)
P.A. Castillo, J. Carpio, J.J. Merelo, V. Rivas, G. Romero, A. Prieto, Evolving multilayer perceptrons. Neural Process. Lett. 12(2), 115–127 (2000)
X. Yao, Evolutionary artificial neural networks. Proc. IEEE 87(9), 1423–1447 (1999)
S. Samarasinghe, Optimum Structure of Feed Forward Neural Networks by SOM Clustering of Neuron Activations. Proceedings of the International Modelling and Simulation Congress (MODSM) (2007)
Neural Networks for Mathematica, (Wolfram Research, Inc. USA, 2002)
J. Sietsma, R.J.F. Dow, Creating artificial neural networks that generalize. Neural Networks 4(1), 67–77 (1991)
Machine learning framework for Mathematica. 2002 Uni software Plus. www.unisoftwareplus.com
J.H. Ward Jr, Hierarchical grouping to optimize an objective function. J. Am Stat. Assoc. 58, 236–244 (1963)
K. Hornik, M. Stinchcombe, H. White, Universal approximation of an unknown mapping and its derivatives using multi-layer feedforard networks. Neural Networks 3, 551–560 (1990)
A.R. Gallant, H. White, On learning the derivative of an unknown mapping with multilayer feedforward networks. Neural Networks 5, 129–138 (1992)
A. Al-yousef, S. Samarasinghe, Ultrasound based computer aided diagnosis of breast cancer: Evaluation of a new feature of mass central regularity degree. Proceedings of the International Modelling and Simulation Congress (MODSM) (2011)
S. Samarasinghe, Hydrocomplexity: New Tools for Solving Wicked Water Problems Hydrocomplexité: Nouveaux outils pour solutionner des problèmes de l’eau complexes (IAHS Publ. 338) (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix: Algorithm for Optimising Hidden Layer of MLP Based on SOM/Ward Clustering of Correlated Weighted Hidden Neuron Outputs
Appendix: Algorithm for Optimising Hidden Layer of MLP Based on SOM/Ward Clustering of Correlated Weighted Hidden Neuron Outputs
-
I.
Train an MLP with a relatively larger number of hidden neurons
-
1.
For input vector X, the weighted input uj and output yj of hidden neuron j are:
$$\begin{aligned} u_{j} & = a_{0\,j} + \sum\limits_{i = 1}^{n} {a_{ij} x_{i} } \\ y_{j} & = f\left( {u_{j} } \right) \\ \end{aligned}$$where a oj is bias weight and a ij are input-hidden neuron weights. f is transfer function.
-
2.
The net input v k and output z k of output neuron k are:
$$\begin{aligned} v_{k} & = b_{0k} + \sum\limits_{j = 1}^{m} {b_{jk} y_{j} } \\ z_{k} & = f\left( {v_{k} } \right) \\ \end{aligned}$$where b ok is bias weight and b jk are hidden-output weights.
-
3.
Mean Square error (MSE) for the whole data set is:
$$MSE = \frac{1}{2N}\left[ {\sum\limits_{i = 1}^{N} {\left( {t_{i} - z_{i} } \right)^{2} } } \right]$$where t is target and N is the sample size.
-
4.
Weights are updated using a chosen method of least square error minimisation, such as Levenberg Marquardt method:
$$w_{m} = w_{m - 1} - \varepsilon \,Rd_{m}$$where dm is sum of error gradient of weight w for epoch m, R is inverse of curvature, and ε is learning rate.
-
5.
Repeat the process 1 to 4 until minimum MSE is reached using training, calibration (testing) and validation data sets.
-
II.
SOM clustering of weighted hidden neuron outputs
Inputs to SOM
An input vector X j into SOM is:
where y j is output of hidden neuron j and b j is its weight to output neuron in MLP. Length n of the vector Xj is equal to the number of samples in the original dataset.
Normalise X j to unit length
SOM training
-
1.
Projecting weighted output of hidden neurons onto a Self Organising Map:
$$u_{j} = \sum\limits_{i = 1}^{n} {w_{ij} x_{i} }$$where u j is output of SOM neuron j and w ij is its weight with input component x i
-
2.
Winner selection: Select winner neuron based on the minimum correlation distance between an input vector and SOM neuron weight vectors (same as Euclidean distance for normalised input vectors)
$$\begin{aligned} d_{\,\,j} & = {\text{x}} - {\text{w}}_{\,\,j} \\ & \sqrt {\sum\limits_{i}^{n} {\left( {x_{i} - w_{ij} } \right)^{2} } } \\ \end{aligned}$$ -
3.
Update of weights of winner and neighbours at iteration t:
Select neighbourhood function NS(d, t) (such as Gaussian) and learning rate function β(t) (such as exponential or linear) where d is distance from winner to a neighbour neuron and t is iteration.
$${\text{w}}_{\text{j}} \left( t \right) = {\text{w}}_{\text{j}} \left( {t - 1} \right) + \beta \left( t \right)NS\left( {d,t} \right)\left[ {{\text{x}}\left( t \right) - {\text{w}}_{\text{j}} \left( {t - 1} \right)} \right]$$ -
4.
Repeat the process until mean distance D between weights W i and inputs x n is minimum.
$$D = \sum\limits_{i = 1}^{k} {\sum\limits_{{n \in c_{i} }} {\left( {{\text{x}}_{n} - {\text{w}}_{i} } \right)^{2} } }$$where k is number of SOM neurons and ci is the cluster of inputs represented by neuron i
-
III.
Clustering of SOM neurons
Ward method minimizes the within group sum of squares distance as a result of joining two possible (hypothetical) clusters. The within group sum of squares is the sum of square distance between all objects in the cluster and its centroid. Two clusters that produce the least sum of square distance are merged in each step of clustering. This distance measure is called the Ward distance (d ward ) and is expressed as:
where x r and x s are the centre of gravity of two clusters. n r and n s are the number of data points in the two clusters.
The centre of gravity of the two merged clusters x r(new) is calculated as:
The likelihood of various numbers of clusters is determined by WardIndex as:
where d t is the distance between centres of two clusters to be merged at current step and d t-1 and d t-2 are such distances in the previous two steps. NC is the number of clusters left.
The numbers of clusters with the highest WardIndex is selected as the optimum.
-
IV.
Optimum number of hidden neurons in MLP
The optimum number of hidden neurons in the original MLP is equal to this optimum number of clusters on the SOM.
Train an MLP with the above selected optimum number of hidden neurons.
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Samarasinghe, S. (2016). Order in the Black Box: Consistency and Robustness of Hidden Neuron Activation of Feed Forward Neural Networks and Its Use in Efficient Optimization of Network Structure. In: Shanmuganathan, S., Samarasinghe, S. (eds) Artificial Neural Network Modelling. Studies in Computational Intelligence, vol 628. Springer, Cham. https://doi.org/10.1007/978-3-319-28495-8_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-28495-8_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28493-4
Online ISBN: 978-3-319-28495-8
eBook Packages: EngineeringEngineering (R0)