Keywords

1 Introduction

One of the core problems in computer vision is image classification. Solving an image classification problem means being able to correctly recognize an image as being part of a class, which translates into the correct identification of key features. Image classification finds a number of direct applications, not restricted to tumor classification and detection [1], bio-metric identification [15, 20, 23], object classification [9] and even emotions [7]. This problem is typically complex to be solved, and a number of algorithms have been designed to tackle it [8, 17, 24]. However, the top-performance model is here represented by neural networks. In particular, the so-called convolutional neural networks (CNNs) are able to automatically take as input images, process them in order to extract the key features for the particular classification problem, and perform the classification itself. Applying very simple optimizing heuristics to minimize the loss function, like SGD [4, 27], slightly more complex optimizers like Nesterov [19] or Adam [14] moving to the sophisticated local entropy minimizer [2, 6], it is nowadays possible to succeed in training extremely complex systems (deep networks) on huge datasets. Theoretically speaking, this is the “miracle” of deep learning, as the dimensionality of the problem is huge (indeed, these problems are typically over-parametrized, and the dimensionality can be efficiently reduced [25]). Furthermore, minimizing non-convex objective functions is typically supposed to make the trained architecture stuck into local minima. However, the empirical evidence shows that something else is happening under the hood: understanding it, in order to provide some warranty for all the possible applications of image classification, is critical.

In this work, we propose an heuristic approach which should help us to understand some basic properties of the found solutions in neural network models. Here, we aim to find a path joining two (or, in general, more) different solutions to the same classification problem. Early attempts to explore possible joining paths were performed using random walk-based techniques, but the complexity of the task, due to the typical high-dimensionality of the problem, made it extremely inefficient [13].

A recent work [12] suggests that solutions to the same problem are typically divided by a loss barrier, but a later work by Draxler et al. [10] shows the existence of low-loss joining paths between similar-performance solutions. Such a work, however, focuses on the loss function, which is a necessary but not sufficient condition to guarantee the performance on the training/test set. Our heuristics puts a hard constraint on it: we will never have a performance (evaluated as the number of samples correctly classified by the neural network model) below a fixed threshold. In the case we ask our model to correctly classify the whole training set, we will say we lie in the solution region S of the training model, also known as version space. This will be our focus along this work.

In the last few years, thanks to the ever-increasing computational capability of computers, bigger and bigger neural networks have been proposed, in order to solve always more complex problems. However, explaining why they succeed in solving complex classification tasks is nowadays a hot research topic [11, 21, 22]. Still, it is object of study why, using simple optimizers like SGD to minimize problems which are typically non-convex, is a sufficient condition to succeed in training deep models [5, 16, 18]. The aim of this work is to move a step in the direction of explaining such a phenomenon, analyzing some typical solutions to learning problems, and inspecting some properties of them. In this way, we aim to give some hints on which type of solutions SGD finds, guessing whether there is some room for improvement or not.

The rest of this paper is organized as follows. In Sect. 2 we set-up the problem environment, aim and the algorithm is illustrated and justified. Next, in Sect. 3 we test our algorithm on MNIST and on training sets containing uncorrelated, randomly-generated patterns. The experiments show that our proposed method is able to always find joining paths in S between any found solution for the same problem. Furthermore, hints on some properties of S are deducted studying the joining path. Finally, Sect. 4 draws the conclusions and suggests further directions for future research.

2 The Proposed Algorithm

2.1 Preliminaries

In our setting, we have a training set \(\varXi _{tr}\) made of M pairs \((\xi _i, \sigma _i)\), in which we identify the set of inputs \(\xi _i\) and their associated desired output \(\sigma _i\). For the purpose of our work, we ask for some configuration W of the neural network such that the entire \(\varXi _{tr}\) problem is correctly solved. If such a condition is met, we say that the configuration W is a solution for the learning problem \(\varXi _{tr}\). In other words, a weights configuration \(W_k\) is a solution when

$$\begin{aligned} y_i\left| \xi _i, W_k\right. = \sigma _i\ \forall i \in \varXi _{tr} \end{aligned}$$
(1)

If we define S as the subset of all the W configurations which solve the whole training problem \(\varXi _{tr}\), we can say that \(W_k \in S\). Let us imagine two solutions to the same problem \(\varXi _{tr}\), \(W_a\) and \(W_b\), are provided. We aim to find a path \(\varOmega _{ab} \subset S\) which joins \(W_a\) to \(W_b\). At this point, we might face two different scenarios:

  1. 1.

    \(\varOmega _{ab}\) is simply a straight line. According to the work by Goodfellow et al., we could draw a straight line between \(W_a\) and \(W_b\) which might be parametrized, for example, as

    $$\begin{aligned} l_{ab}(t) = (W_b - W_a)t + W_a \end{aligned}$$
    (2)

    with \(t\in [0, 1]\). According to this scenario, this is a sufficient condition to join the two different solutions. However, as showed by the same work of Goodfellow et al., this is not typical [12].

  2. 2.

    \(\varOmega _{ab}\) is a “non-trivial” path as \(l_{ab} \not \subset S\), and is not a-priori guaranteed to exist. This is the typical scenario, and the setting in which we are going to work. The work by Draxler et al. [10] shows that there exists a path \(\varGamma _{ab}\) having low loss value, however, in general, \(\varGamma _{ab}\not \subset S\). Our heuristics not only works for the case \(\varOmega _{ab} \ne l_{ab}\), but it guarantees \(\varOmega _{ab} \subset S\) (Fig. 1).

Fig. 1.
figure 1

Example of the “non trivial” scenario. Here, while \(l_{ab}\) goes out the solution region S, \(\varOmega _{ab}\) still remains inside it.

2.2 Finding the Path

Our heuristics generates the path \(\varOmega _{ab}\) in a “Markov chain” fashion: we are going to use a “survey” network \(W_x\), whose task is to modify its configuration (i.e. the value of its parameters) in order to move from \(W_a\)’s configuration to \(W_b\), never leaving S. Hence, at time \(t=0\) we initialize \(W_x = W_a\), and we ask \(W_x\) to explore S such that, at some time \(t_f\), \(W_x = W_b\). The exploration algorithm is designed according to three, very simple, basic concepts:

  • We will never leave S.

  • As we start from \(W_a\), we want to arrive to \(W_b\) using a survey network \(W_x\), which draws \(\varOmega _{ab}\) in \(t_f\) steps.

  • At time \(t'\), \(W_x\) will just have knowledge of the training set, direction and distance towards the target \(W_b\).

In order to reach \(W_b\), we need to drive \(W_x\) at any time t to it. Towards this end, we use an elastic force:

$$\begin{aligned} \varDelta W_x^{t} = \gamma (W_b - W_x^{t-1}) \end{aligned}$$
(3)

where \(\gamma \) is an elastic constant, whose value is typically \(\gamma \ll 1\).

If we just apply Eq. 3, in the non-trivial scenario, \(W_x\) will leave the solution region, as we will have \(\varOmega _{ab} \equiv l_{ab}\). Hence, what we need here is to change the trajectory for our \(\varOmega _{ab}\) in a “smart” way. It will be nice to stay away from the frontier of S. A local information we have, which might come handy in this context, is the gradient on the training set. If we perform a GD step, \(W_x\) should be naturally driven down the loss function and, supposedly, drives \(W_x\) away from the frontier of S.

Along with the elastic coupling and the GD step, we impose a norm constraint for \(W_x\), acting as a regularizer, to be applied layer-by-layer, which bounds \(W_x\)’s norm to:

$$\begin{aligned} n\left( W_x^l\right) = \Vert W_b^l \Vert _F - \frac{\Vert W_b^l \Vert _F - \Vert W_a^l \Vert _F}{\Vert W_b^l - W_a^l \Vert _F} \Vert W_b^l - W_x^l \Vert _F \end{aligned}$$
(4)

where \(W_x^l\) indicates the l-th layer of \(W_x\) and \(\Vert \cdot \Vert _F\) is the Frobenius norm. Essentially, we are imposing a linear constraint to the norm of \(W_x\), which is function of the distance from \(W_b\). Finally, as we have the hard constraint on remaining into S, we need to impose small steps for \(W_x\)

$$\begin{aligned} W_x^t = W_x^{t-1} + \delta W_x^t \end{aligned}$$
(5)

where, typically,

$$\begin{aligned} \delta W_x^t \ll \nabla W_x^{t-1} \end{aligned}$$
(6)

In this way, unless we find a local minimum very close to \(W_b\) and exactly on the same path followed by \(W_x\) (extremely unlikely as empirically observed, issue which can be anyway easily tackled with a proper tuning of \(\gamma \)), we avoid to get stuck in any local minimum.

To sum-up, in order to generate \(\varOmega _{ab}\), after we have initialized \(W_x\) to \(W_a\), we iteratively perform the following steps:

  1. 1.

    Apply an elastic coupling in the direction of \(W_b\) (Eq. 3), with the hard constraint of never leaving S (this is hard because we will simulate a “hitting a wall”-like fashion, i.e. we will discard all the steps which will put \(W_x\) outside S)

  2. 2.

    Perform \(N_{epochs}\) of gradient descent (GD) steps evaluated on \(\varXi _{tr}\)

  3. 3.

    Properly normalize \(W_x\) (Eq. 4)

The general algorithm is summarized in Algorithm 1.

figure a

2.3 Properties of the Path

Once we have obtained \(\varOmega _{ab}\), we can perform an empirical investigation on it. There are some interesting observations we can perform on it:

  • Is there any property related to the shape of S? As typical problems are extremely high-dimensional, it is very difficult to deduct some global property on S. However, we might have some hint on how S is shaped from two indicators:

    • If we are always able to find \(\varOmega _{ab}\subset S\), then we might suggest that all the algorithmically-accessible solutions in S, collected in \(S_{algo}\subset S\), live in a connected subspace.

    • We can study the Hessian along \(\varOmega _{ab}\). Even though this is not a fully-informative observation for S, we can deduce some properties, like the shape of the loss in \(S_{algo}\).

  • Verify how the loss function varies along \(\varOmega _{ab}\): as our technique is strictly bounded to S and not necessarily to the minimization of the loss function, it may happen that some solutions to \(\varXi _{tr}\) have high loss.

  • Check how the generalization error, defined as the error on the test set, varies along \(\varOmega _{ab}\).

All of these aspects will be empirically investigated in Sect. 3.

3 Experiments

The proposed algorithm was tested under two very different settings and architectures. In both cases, a \(\varOmega _{ab}\subset S\) path has always been found.

3.1 Tree Committee Machine on Random Patterns

In our first experiments, we use a simple kind of neural network, the so-called Tree Committee Machine (TCM). It is a binary classifier, consisting in one-hidden neural network having N inputs and K neurons in the hidden layer. The connectivity of the hidden layer is here tree-like: each k-th neuron of the hidden layer is able to receive data from an exclusive \(\frac{N}{K}\) subset of the input. In particular, for our setting, the general output of the TCM is defined as

$$\begin{aligned} \hat{y}_i = \tanh \left[ \sum _{k=1}^{K} \text{ htanh } \left( \sum _{j=1}^{\frac{N}{K}}W_{kj}\cdot \xi _{k\frac{N}{K}+j}^\mu \right) \right] \end{aligned}$$
(7)

where htanh is the hard tanh.

The training set \(\varXi _{tr}\) is randomly generated: the input patterns \(\xi _i \in \{-1; +1\}^{(N\times M)}\) and random desired outputs \(\sigma \in \{-1, +1\}^M\).

All the experiments here shown are performed on TCMs having size \(N=300\) and \(K=3\) and the training sets consist in \(M=620\) samples. The training of the reference solutions to \(\varXi _{tr}\) has been performed using the standard GD technique, minimizing the binary cross-entropy loss function, with \(\eta =0.1\). The network has been initialized using a gaussian initializer. In this setting, \(\gamma =0.001\) and \(N_{epochs} = 5\). The algorithm was tested on 10 different, randomly-generated datasets, and for each of them 3 different configurations \(W_i\in S\) were obtained and attempted to be connected. The implementation of the neural network and of the algorithm is in Julia 0.5.2 [3].

Even though we are in the typical scenario for which the error on \(l_{ab}\) > 0, we are able to find a non-trivial path in S. For this network, it is also possible to compute the exact Hessian matrix. Surprisingly, the typical observed scenario here is that, along any found \(\varOmega \subset S\), all the non-zero eigenvalues of the Hessian matrix are strictly positive, the cardinality of non-zero eigenvalues is constant and the reference solutions represent local minima for the trace of the Hessian matrix. An example of this observed result is shown in Fig. 2. This result is obtained in a hard learning scenario, and may suggest us that, even though the learning problems are typically non-convex, GD-based techniques work because the algorithmically accessible \(S_{algo}\) region, is convex.

Fig. 2.
figure 2

Hessian eigenvalues along \(\varOmega _{ab}\) in TCM for random patterns. In this case, just three eigenvalues are non-zero and all positive. Along the path, the loss on the training set is proportional to the trace of the Hessian.

3.2 LeNet5 on MNIST

Experiments on LeNet5 solutions trained on the MNIST dataset have been performed. In particular, at first simulations on a reduced MNIST are shown (training is performed on the first 100 images: we are going to call it MNIST-100) and on the full MNIST dataset. The software used for the following simulations is PyTorch 1.1 with CUDA 10.

Fig. 3.
figure 3

Example of \(\varOmega _{ab}\) for LeNet5 with MNIST-100. The x axis is a normalized distance between \(W_a\) and \(W_b\).

For the MNIST-100 case, the networks have been trained using SGD with \(\eta =0.1\), and initialized with Xavier. The joining path heuristic used \(\gamma =0.001\) and \(N_{epochs} = 5\). Despite the higher dimensionality and complexity of LeNet5, also in this case it has always been possible to find a \(\varOmega _{ab}\) path in S. A typical observed behavior is shown in Fig. 3. It is here interesting to observe that in general, moving through \(\varOmega \), both the training and test loss are no longer monotonic or bi-tonic, but they show a more complex behavior (an example is in Fig. 3(a)). Furthermore, observing the test set error, it shows a similar behavior to the test set loss, but not locally exactly the same (Fig. 3(b)).

Fig. 4.
figure 4

Hessian eigenvalues along \(\varOmega _{ab}\) for LeNet5 trained on MNIST-100 (same experiment as Fig. 3). Here the top-20 Hessian eigenvalues are plotted.

Fig. 5.
figure 5

Example of \(\varOmega _{ab}\) for LeNet5 with the entire training set. The x axis is a normalized distance between \(W_a\) and \(W_b\).

It can be here interesting to investigate the behavior of the eigenvalues of the Hessian along \(\varOmega _{ab}\) also in this scenario. Figure 4 is a plot for the top-20 eigenvalues. The Hessian eigenvalue computation has been performed here using the code by Gholami [26]. Interestingly, even for a more complex architecture like LeNet5, along the entire \(\varOmega _{ab}\) path, the top eigenvalues are all positive.

Besides the simulations on MNIST-100, we also attempted to find a joining path between two solutions for the entire MNIST dataset, still using LeNet5. In this case, \(\gamma =0.1\) and \(N_{epochs} = 5\), while the training of the initial configuration is performed using SGD with \(\eta =0.1\) and initializing with Xavier. According to our findings, in this setting, a zero-error joining path, even for the whole MNIST problem, typically exists (Fig. 5). Interestingly, the best generalization performance (at about 0.2 in the normalized distance scale) is here found far from both the solutions found by SGD, and typically can not be found by vanilla-SGD, as there is a higher training loss value (even if it lies in the version space).

4 Conclusion

In this work, a heuristic approach to find a path \(\varOmega _{ab}\) joining two solutions \(W_a\) and \(W_b\) to the same training problem \(\varXi _{tr}\) is proposed. The main property of \(\varOmega _{ab}\) is that it entirely lies in the solution space S of the W’s configurations which solve the training problem. In general, such an approach is not guaranteed to produce an \(\varOmega \): if S is not connected and \(W_a\) and \(W_b\) belong to two different sub-spaces of S, by construction, \(\varOmega _{ab}\) can not exist. By our empirical observations, with a randomly-generated, uncorrelated, synthetic training set and with MNIST, the subspace \(S_{algo} \subseteq S\) accessed by GD-based techniques seems to be connected. Furthermore, we have some hints indicating that \(S_{algo}\) might be convex and a further proof that SGD alone is not sufficient to guarantee the best generalization, neither for nowadays simple classification problems like MNIST.

The proposed technique potentially allows us to extend the investigation of S also to non-typical algorithmic solutions to the learning problem, along the drawn \(\varOmega \) paths. These findings open to new researches in the field of explainable neural networks. Future work involves the study of how the generalization error varies along \(\varOmega \) on more complex classification tasks and the design of an algorithm to boost it.