# Interval Adjoint Significance Analysis for Neural Networks

- 771 Downloads

## Abstract

Optimal neural network architecture is a very important factor for computational complexity and memory footprints of neural networks. In this regard, a robust pruning method based on interval adjoints significance analysis is presented in this paper to prune irrelevant and redundant nodes from a neural network. The significance of a node is defined as a product of a node’s interval width and an absolute maximum of first-order derivative of that node’s interval. Based on the significance of nodes, one can decide how much to prune from each layer. We show that the proposed method works effectively on hidden and input layers by experimenting on famous and complex datasets of machine learning. In the proposed method, a node is removed based on its significance and bias is updated for remaining nodes.

## Keywords

Significance analysis Sensitivity analysis Neural network pruning Interval adjoints## 1 Introduction

Neural networks and deep belief networks are powerful tools of machine learning for classification tasks. There are many things to consider for the construction of effective neural network architecture i.e., learning rate, optimization method, regularization, etc. But one of the most important hyper-parameter is network size. It is hard to guess the optimal size of a network. Large networks are good at memorization and get trained quickly but there is a lack of generalization in the large networks. We can end up in over-fitting our networks. We can solve this problem of generalization by constructing smaller networks and save the computational cost of classification but this approach can end up in under-fitting. Success is to come up with neural network architecture which can solve both problems [1].

Researchers have proposed different techniques such as; brute-force [2], growing [3] and pruning methods [4]. Out of these techniques, pruning results in effective compressed neural network architecture while not significantly hurting network accuracy. This technique starts with a well-trained network. Assuming the network is oversized, it tries to remove irrelevant or insignificant parameters from the network. These parameters can be network’s weights, inputs, or hidden units.

Over time multiple pruning methods have been proposed (see detailed surveys [1, 2, 5]). Among many methods, the sensitivity-based analysis technique is the most famous one [6, 7, 8, 9, 10]. It measures the impact of neural network parameters on the output. Our proposed method also utilizes the concept of sensitivity to define the significance of network parameters. Sensitivities of an objective function concerning weights of networks are used to optimize network’s weights while sensitivities of output unit concerning input and hidden units are used to find the significance of the network’s input and hidden units.

This paper presents a method for finding out the sensitivities of the network’s output concerning the network’s input and hidden units in a more robust and efficient way by using interval adjoints. Input and hidden unit values and their impact on output are used to define the significance of the input and hidden units. The significance analysis method defined in this paper takes care of all the information of the network units and the information stored during significance analysis is used to update the remaining parameters biased in the network.

The rest of the paper is organized as follows. Section 2 briefly describes algorithmic differentiation (AD) for interval data with examples. Section 3 presents our significance analysis method for pruning. Experimental results are given in Sect. 4. The conclusion is given in Sect. 5.

## 2 Interval Adjoint Algorithmic Differentiation

The brief introduction to AD [11, 12] is given in this section along with the modes of AD and differentiation with one of the mode of AD commonly known as adoint mode of AD. Later, intervals for interval adjoint algorithmic differentiation are used.

### 2.1 Basics of AD

Let *F* be a differentiable implementation of a mathematical function \(F : \mathbb {R}^{n+l} \longrightarrow \mathbb {R}^{m} : \mathbf {y} = (\mathbf {x, p})\), computing an output vector \(\mathbf{y} \in \mathbb {R}^m\) from inputs \(\mathbf{x} \in \mathbb {R}^n\) and constant input parameter \(\mathbf{p} \in \mathbb {R}^l\). Differentiating *F* with respect to \(\mathbf{x}\) yields the Jacobian matrix \(\nabla _\mathbf{x}F \in \mathbb {R}^{m*n}\) of *F*.

This mathematical function *F* can be transformed to coded form in some higher level programming language to apply AD on that code. AD works on the principle of the chain rule. It can be implemented using source transformation [13] or operator overloading [14] to change the domain of variables involved in the computation. It calculates the derivatives and different partials along with the each output (primal values) in a time similar to one evaluation of the function. There are multiple tools^{1} available which can implement AD e.g.
Open image in new window
[15, 16].
Open image in new window
implements AD with the help of operator overloading and it has been successfully used for many applications [17, 18].

### 2.2 Modes of AD

There are many modes of AD [11, 12] but two of them are most widely used; one is forward mode AD (also; tangent linear mode of AD) and second is reverse mode AD (also; adjoint mode of AD). Because, we are interested in adjoints for our research so we are going to describe reverse mode AD briefly.

There are two phases in reverse mode of AD; forward and backward pass. In forward pass, function code is run forward yielding primal values for all intermediate and output variables and storing all the relevant information which are needed during backward pass. During backward pass, adjoints of outputs (for outputs, adjoint is evidently 1) will be propagated backwards through the computation of the original model to adjoints of inputs.

**Example (AD Reverse Mode).** Below is an example of adjoint mode AD evaluation on function \(f(\mathbf{{x}}) = sin(x _o \,\cdot \, x _1)\).

### 2.3 Interval Adjoints

Consider, lower case letters (e.g., *a*, *b*, *c*, ...) represents real numbers, uppercase letters (e.g., *A*, *B*, *C*, ...) represents interval data and an interval is represented by \(X = [x^{l}, x^{u}]\), where *l* and *u* represents the lower and upper limit of the interval, respectively.

Interval Arithmetic (IA), evaluates a function *f*[*X*] for the given range over a domain in a way that it gives guaranteed enclosure \(f[X] \supseteq \{f[x]|x \ni [X]\}\) that contains all possible values of *f*(*x*) for \(x \ni [X]\). Similarly, interval evaluation yield enclosures \([V_{i}]\) for all intermediate variables \(V_{i}\).

The adoint mode of AD can also be applied to interval functions for differentiation purpose [19]. The impact of individual input and intermediate variables on the output of an interval-valued function can easily be evaluated by using adjoint mode of AD over interval functions. AD not only computes the primal value of intermediate and output variables, it also computes their derivative with the help of chain rule. The first order derivatives (\(\frac{\delta {Y}}{\delta {X_{i}}}\), \(\frac{\delta {Y}}{\delta {V_{i}}}\)) of output *V* with respect to all inputs \(X_{i}\) and intermediate variables \(V_{i}\) can be computed with the adjoint mode of AD in a single evaluation of function *f*. In the same way, IA and AD can be used to find out the interval-valued partial derivatives (\(\nabla _{[V_{i}]} [Y]\), \(\nabla _{[X_{i}]} [Y]\)) that contains all the possible derivatives of output *Y* with respect to intermediate variables \(V_{i}\) and input variables \(X_{i}\) over the given input interval *X*.

**Example (AD Reverse Mode with Interval Data).** Below is an example of adjoint mode AD evaluation on interval function \(f(\mathbf{{X}}) = sin(X _1 \,.\, X _2)\) for calculation of interval adjoints. Let \(X = \{X_1, X_2\} \in \mathbb {R}^2\), where \(X_1 = [0.5, 1.5], X_2 = [ -0.5, 0.5]\).

*f*(

*X*):

## 3 Significance Analysis

Sensitivity based method is most useful in defining the significance of the network parameters. In [7, 10], researchers used the network’s output to find the sensitivity of network parameters. Sensitivity is defined as the degree to which an output responds to the deviation in its inputs [7]. Deviation in output \(\varDelta {y}\) due to deviation in inputs is the difference of deviated and non-deviated outputs \(f((X+\varDelta {X})*w)-f(X*w)\). Meanwhile, inputs \(X_i\) is treated as an interval [0, 1] for finding sensitivity not just on fixed points rather finding it for overall inputs range.

With the above-defined sensitivity, the significance is measured as a product of sensitivity of a node by the summation of the absolute values of its outgoing weights. This approach of significance defined by [7] has few limitations such as it can only be applied to hidden layer nodes, not to the network input layer. Secondly, one has to prune one layer first before moving to the second layer as it works by layer-wise.

To find out the significance for both input and hidden nodes of a network, another method proposed in [10], computes sensitivities by computing partial derivatives of network outputs to input and hidden nodes. Although this method is good in computing sensitivities of network’s parameters, there is a high computational cost for this as it computes partial derivative at a given parameter and then finds out the average sensitivity for all training patterns. We proposed a new sensitivity method in Sect. 3.1 based on interval adjoints to tackle the shortcomings of earlier defined sensitivity based pruning methods.

### 3.1 Interval Adjoint Significance Analysis

*Y*and different intermediate variables

*V*, may increase or decrease the influence of variable \(V_{j}\). Therefore, it is necessary to find the influence of that variable \(V_ {j}\) on output

*Y*. The absolute maximum of first order partial derivative \(max |\nabla _{[v_ {i}]} [y]|\) of variable \(V_{j}\) gives us this influence of variable \(V_{j}\) over output

*Y*.

### 3.2 Selection of Significant Nodes

A single interval input vector is used for finding the significance. Let us assume that *m* training patters \(x_{k} = \{{x_{k1}, . . ., x_{kn}}\}\), \(k = 1,2,3,...,m\) are used to train the neural network. These *m* training patterns are used to generate the interval input vector for significance analysis by finding out the maximum and minimum value for each input (\(e.g.~x_{1}\)) from all training patterns \((x_{11},...,x_{k1})\). These maximum and minimum values are used to construct the interval input vector \(X = \{[min(x_{k1}), max(x_{k1})],..., [min(x_{kn}), max(x_{kn})]\}\). As scalar is degenerated form of an interval whose upper and lower bounds are the same, one can use the trained weight and bias vector of the network and change them to interval vectors whose upper and lower bounds are the same.

### 3.3 Removal of Insignificant Nodes

After the selection of significant nodes in the network, it is necessary to preserve the information of insignificant nodes before throwing them out from the network to prune it, otherwise, we will be changing the inputs for next layer activations. Insignificant nodes have less impact on the network and the weight associated with its incoming and ongoing connections are mostly redundant and have very low values. Significance analysis together with interval evaluation not only gives us the influence of nodes but it also gives us a guaranteed enclosure that contains all the possible values for insignificant nodes and their outgoing connections. We can store all the information of the previous layer insignificant nodes into significant nodes of the next layer by calculating the midpoints \( (v_{j} = \frac{(v_j^l + v_j^u)}{2} )\) of all incoming connections from previous layer insignificant nodes to significant nodes of next layer. We can sum up all these midpoints and add them as the bias of significant nodes. This process is illustrated in Fig. 3.

## 4 Experimental Results

There are so many parameters to optimize in all the layers of fully connected neural networks and on the fully connected layers of most of the large-scale convolutional neural networks [24, 25]. Currently, we analyzed the performance of our method on fully connected networks to reduce the number of parameters and obtain a compressed network yet achieving the same accuracy. For this purpose, we choose four datasets; MNIST [26], MNIST_ROT [27], Fashion-MNIST [29] and CIFAR-10 [28].

Datasets and networks used for experiments

Dataset | Architecture |
---|---|

MNIST | 784-500-500-10 |

MNIST_ROT | 784-500-500-10 |

Fashion-MNIST | 784-500-500-10 |

CIFAR-10 | 3072-500-500-10 |

### 4.1 Experiments on MNIST

Significance analysis works pretty well on CIFAR-10 dataset too and we can remove a few hundreds of input features from the network as shown in Fig. 8. Figure 5, 6 and 7 show plots of nodes removal from each hidden layer and nodes removal from all hidden layers in the network after the significance analysis respectively.

### 4.2 Experiments on CIFAR-10

From the last 11 years, CIFAR-10 has been the focus of intense research which makes it an excellent test case for our method. It has been most widely used as a test case for many computer vision methods. After MNIST, it has been ranked the second most referenced dataset [32]. CIFAR-10 is still the subject of current research [33, 34, 35] because of its difficult problems. All of its 50k train samples were used to train the network and 10k test samples were used to test the network.

Like MNIST, we can also see the pattern of error increasing when we apply significance analysis on all hidden layers. Error plot of first hidden layer nodes removal in Fig. 5 and error plot of all hidden layer nodes removal in Fig. 7 looks quite the same. This is because the second hidden layer is not contributing much to the overall performance of the network.

### 4.3 Experiments on MNIST_ROT and Fashion-MNIST

### 4.4 Network Performance Before and After Retraining

Training and test error before and after retraining for different percentage of removed insignificant nodes. Initially, all the neural nets have 1000 hidden nodes.

Removal of neurons from hidden layers | 90% | 80% | 75% | 50% | 25% | 0% | |
---|---|---|---|---|---|---|---|

MNIST | Training error | 0.26 | 0.03 | 0.01 | 0.01 | 0.01 | 0.01 |

After retraining | 0.002 | 0.001 | 0.0009 | 0.01 | 0.01 | ||

Test error | 0.26 | 0.03 | 0.02 | 0.02 | 0.02 | 0.02 | |

After retraining | 0.02 | 0.02 | 0.02 | 0.01 | 0.01 | ||

MNIST_ROT | Training error | 0.53 | 0.12 | 0.11 | 0.10 | 0.09 | 0.09 |

After retraining | 0.02 | 0.0001 | 0.0005 | 0.0004 | 0.003 | ||

Test error | 0.53 | 0.15 | 0.14 | 0.13 | 0.12 | 0.12 | |

After retraining | 0.13 | 0.10 | 0.10 | 0.10 | 0.10 | ||

Fashion-MNIST | Training error | 0.65 | 0.13 | 0.10 | 0.09 | 0.09 | 0.09 |

After retraining | 0.05 | 0.04 | 0.03 | 0.03 | 0.03 | ||

Test error | 0.65 | 0.16 | 0.13 | 0.12 | 0.12 | 0.12 | |

After retraining | 0.11 | 0.12 | 0.11 | 0.12 | 0.11 | ||

CIFAR-10 | Training error | 0.60 | 0.49 | 0.47 | 0.42 | 0.42 | 0.42 |

After retraining | 0.46 | 0.42 | 0.43 | 0.42 | 0.42 | ||

Test error | 0.64 | 0.54 | 0.52 | 0.48 | 0.48 | 0.48 | |

After retraining | 0.50 | 0.49 | 0.49 | 0.48 | 0.48 |

Train and test error slightly increase if we remove 90% of the hidden nodes from the network and this is expected as we are taking away too much information from the original network. But we can improve the performance of network with retraining and using the remaining old weight vector and updated bias vector for significant nodes. It is better to use the remaining network connection for retraining than initializing the values again. After retraining, the error rate was decreased significantly for 90% nodes removal from the original network and in some cases decreasing the original error rate that was there before significance analysis.

## 5 Future Work and Conclusion

A new method of finding and removing redundant and irrelevant nodes from the neural network using interval adjoints is proposed in this paper. Our method finds out the significance of hidden as well as input nodes. The significance depends upon two factors, the impact of a node on output and width of a node interval. The use of interval data and finding sensitivities with interval adjoints make our method more robust than multiple existing methods. The results presented in this paper indicate that the significance analysis correctly finds out irrelevant input and hidden nodes in a network and it also gives us much information to update the bias of relevant nodes so that performance of the network does not comprise by removing irrelevant nodes.

Our future work will be aimed at applying interval adjoint significance analysis on convolutional and fully connected layers of convolutional neural networks. Furthermore, investigation will be carried out on applying significance analysis during the training of a network and speed up the training process by eliminating the less significant nodes from the network.

## Footnotes

## References

- 1.Augasta, M.G., Kathirvalavakumar, T.: Pruning algorithms of neural networks — a comparative study. Cent. Eur. J. Comp. Sci.
**3**(3), 105–115 (2013). https://doi.org/10.2478/s13537-013-0109-xCrossRefGoogle Scholar - 2.Reed, R.: Pruning algorithms-a survey. IEEE Trans. Neural Networks
**4**(5), 740–747 (1993)CrossRefGoogle Scholar - 3.Fahlman, S.E., Lebiere, C.: The cascade-correlation learning architecture. In: Advances in Neural Information Processing Systems, pp. 524–532 (1990)Google Scholar
- 4.Castellano, G., Fanelli, A.M., Pelillo, M.: An iterative pruning algorithm for feedforward neural networks. IEEE Trans. Neural Networks
**8**(3), 519–531 (1997)CrossRefGoogle Scholar - 5.Cheng, Y., Wang, D., Zhou, P., Zhang, T.: A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282 (2017)
- 6.Xu, J., Ho, D.W.: A new training and pruning algorithm based on node dependence and Jacobian rank deficiency. Neurocomputing
**70**(1–3), 544–558 (2006)CrossRefGoogle Scholar - 7.Zeng, X., Yeung, D.S.: Hidden neuron pruning of multilayer perceptrons using a quantified sensitivity measure. Neurocomputing
**69**(7–9), 825–837 (2006)CrossRefGoogle Scholar - 8.Lauret, P., Fock, E., Mara, T.A.: A node pruning algorithm based on a Fourier amplitude sensitivity test method. IEEE Trans. Neural Networks
**17**(2), 273–293 (2006)CrossRefGoogle Scholar - 9.Hassibi, B., Stork, D.G., Wolff, G.J.: Optimal brain surgeon and general network pruning. In: IEEE International Conference on Neural Networks, pp. 293–299 (1993)Google Scholar
- 10.Engelbrecht, A.P.: A new pruning heuristic based on variance analysis of sensitivity information. IEEE Trans. Neural Networks
**12**(6), 1386–1399 (2001)CrossRefGoogle Scholar - 11.Griewank, A., Walther, A.: Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. SIAM, Philadelphia (2008)CrossRefGoogle Scholar
- 12.Naumann, U.: The Art of Differentiating Computer Programs: An Introduction to Algorithmic Differentiation. SIAM, Philadelphia (2012)zbMATHGoogle Scholar
- 13.Hascoet, L., Pascual, V.: The Tapenade automatic differentiation tool: principles, model, and specification. ACM Trans. Math. Softw. (TOMS)
**39**(3), 1–43 (2013)MathSciNetCrossRefGoogle Scholar - 14.Corliss, G., Faure, C., Griewank, A., Hascoet, L., Naumann, U.: Automatic Differentiation of Algorithms. Springer, New York (2013)zbMATHGoogle Scholar
- 15.Lotz, J., Leppkes, K., Naumann, U.: dco/c++ - derivative code by overloading in C++. https://www.stce.rwth-aachen.de/research/software/dco/cpp
- 16.Lotz, J., Naumann, U., Ungermann, J.: Hierarchical algorithmic differentiation a case study. In: Forth, S., Hovland, P., Phipps, E., Utke, J., Walther, A. (eds.) Recent Advances in Algorithmic Differentiation. LNCSE, vol. 87, pp. 187–196. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30023-3_17CrossRefzbMATHGoogle Scholar
- 17.Towara, M., Naumann, U.: A discrete adjoint model for OpenFOAM. Procedia Comput. Sci.
**18**, 429–438 (2013)CrossRefGoogle Scholar - 18.Lotz, J., Schwalbach, M., Naumann, U.: A case study in adjoint sensitivity analysis of parameter calibration. Procedia Comput. Sci.
**80**, 201–211 (2016)CrossRefGoogle Scholar - 19.Schichl, H., Neumaier, A.: Interval analysis on directed acyclic graphs for global optimization. J. Global Optim.
**33**(4), 541–562 (2005). https://doi.org/10.1007/s10898-005-0937-xMathSciNetCrossRefzbMATHGoogle Scholar - 20.Deussen, J., Riehme, J., Naumann, U.: Interval-adjoint significance analysis: a case study (2016). https://wapco.e-ce.uth.gr/2016/papers/SESSION2/wapco2016_2_4.pdf
- 21.Kelley, H.J.: Gradient theory of optimal flight paths. ARS J.
**30**(10), 947–954 (1960)CrossRefGoogle Scholar - 22.Rojas, R.: The backpropagation algorithm. In: Neural Networks, pp. 149–182. Springer, Heidelberg (1996). https://doi.org/10.1007/978-3-642-61068-4_7
- 23.Moore, R.E.: Methods and Applications of Interval Analysis. Society for Industrial and Applied Mathematics, Philadelphia (1979)CrossRefGoogle Scholar
- 24.Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
- 25.Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
- 26.LeCun, Y., Cortes, C.: MNIST handwritten digit database (2010). http://yann.lecun.com/exdb/mnist/
- 27.Larochelle, H., Erhan, D., Courville, A., Bergstra, J., Bengio, Y.: An empirical evaluation of deep architectures on problems with many factors of variation. In: ACM Proceedings of the 24th International Conference on Machine Learning, pp. 473–480 (2007)Google Scholar
- 28.Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images, vol. 1, no. 4, p. 7. Technical report, University of Toronto (2009)Google Scholar
- 29.Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747 (2017)
- 30.Loshchilov, I., Hutter, F.: Fixing weight decay regularization in Adam. arXiv:1711.05101 (2017)
- 31.Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv:1412.6980 (2014)
- 32.Hamner, B.: Popular datasets over time. https://www.kaggle.com/benhamner/populardatasets-over-time/code
- 33.Real, E., Aggarwal, A., Huang, Y., Le, Q.V.: Regularized evolution for image classifier architecture search. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 4780–4789 (2019)Google Scholar
- 34.Miikkulainen, R., et al.: Evolving deep neural networks. In: Artificial Intelligence in the Age of Neural Networks and Brain Computing, pp. 293–312 (2019)Google Scholar
- 35.Su, J., Vargas, D.V., Sakurai, K.: One pixel attack for fooling deep neural networks. IEEE Trans. Evol. Comput.
**23**(5), 828–841 (2019)CrossRefGoogle Scholar