1 Formulation of the Problem

Need for machine learning. In many practical situations, we know that the quantities \(y_1,\ldots ,y_L\) depend on the quantities \(x_1,\ldots ,x_n\), but we do not know the exact formula for this dependence. To get this formula, we measure the values of all these quantities in different situations \(m=1,\ldots ,M\), and then use the corresponding measurement results \(x^{(m)}_i\) and \(y^{(m)}_\ell \) to find the corresponding dependence. Algorithms that “learn” the dependence from the measurement results are known as machine learning algorithms.

Neural networks (NN): main idea and successes. One of the most widely used machine learning techniques is the technique of neural networks (NN)—which is based on a (simplified) simulation of how actual neurons works in the human brain (a brief technical description of this technique is given in Sect. 2). This technique has many useful applications; see, e.g., [1, 2].

At present (2020) multi-layer (“deep”) neural networks are, empirically, the most efficient of the known machine learning techniques.

Neural networks: limitations. One of the main limitations of neural networks is that their learning is very slow: they need many thousand iterations just to learn a simple dependence.

This slowness is easy to explain: the current neural networks always start “from scratch”, from zero knowledge. In terms of simulating human brain, they do not simulate how we learn the corresponding dependence—they simulate how a newborn child will eventually learn to recognize this dependence. Of course, this inability to take any prior knowledge into account drastically slows down the learning process.

What is prior knowledge. Prior knowledge means that we know some relations (“constraints”) between the desired values \(y_1,\ldots ,y_L\) and the observed values \(x_1,\ldots ,x_n\), i.e., we know several relations of the type

$$ f_c(x_1,\ldots ,x_n,y_1,\ldots ,y_L)=0,\ \ 1\le c\le C. $$

Prior knowledge helps humans learn faster. Prior knowledge helps us learn. Yes, it takes some time to learn this prior knowledge, but this has been done before we have samples of \(x_i\) and \(y_\ell \). As a result, the time from gathering the samples to generating the desired dependence decreases.

This is not simply a matter of accounting: the same prior knowledge can be used (and usually is used) in learning several different dependencies. For example, our knowledge of sines, logarithms, of calculus helps in finding the proper dependence in many different situations. So, when we learn the prior knowledge first, we decrease the overall time needed to learn all these dependencies.

How to speed up artificial neural networks: a natural idea. In view of the above explanation, a natural idea is to enable neural networks to take prior knowledge into account. In other words, instead of learning all the data “from scratch”, we should first learn the constraints. Then, when it is time to use the data, we should be able to use these constraints to “guide” the neural network in the right direction.

What we do in this paper. In this paper, we show how to implement this idea and thus, how to (hopefully) achieve the corresponding speed-up.

To describe this idea, we first, in Sect. 2, recall how the usual NN works. Then, in Sect. 3, we show how we can perform a preliminary training of a NN, so that it can learn to satisfy the given constraints. Finally, in Sect. 4, we show how to train the resulting pre-trained NN in such a way that the constraints remain satisfied.

2 Neural Networks: A Brief Reminder

Signals in a biological neural network. In a biological neural network, a signal is represented by a sequence of spikes. All these spikes are largely the same, what is different is how frequently the spikes come.

Several sensor cells generate such sequences: e.g., there are cells that translate the optical signal into spikes, there are cells that translate the acoustic signal into spikes. For all such cells, the more intense the original physical signal, the more spikes per unit time it generates. Thus, the frequency of the spikes can serve as a measure of the strength of the original signal.

From this viewpoint, at each point in a biological neural network, at each moment of time, the signal can be described by a single number: namely, by the frequency of the corresponding spikes.

What is a biological neuron: a brief description. A biological neuron has several inputs and one output. Usually, spikes from different inputs simply get together—probably after some filtering. Filtering means that we suppress a certain proportion of spikes. If we start with an input signal \(x_i\), then, after such a filtering, we get a decreased signal \(w_i\cdot x_i\). Once all the inputs signals are combined, we have the resulting signal \(\sum \limits _{i=1}^n w_i\cdot x_i\).

A biological neuron usually has some excitation level \(w_0\), so that if the overall input signal is below \(w_0\), there is practically no output. The intensity of the output signal thus depends on the difference \(d{\mathop {=}\limits ^\mathrm{def}}\sum \limits _{i=1}^n w_i\cdot x_i-w_0\). Some neurons are linear, their output is proportional to this difference. Other neurons are non-linear, they output is equal to \(s_0(d)\) for some non-linear function \(s_0(z)\). Empirically, it was found that the corresponding non-linear transformation takes the form \(s_0(z)=1/(1+\exp (-z))\).

Comment. It should be mentioned that this is a simplified description of a biological neuron: the actual neuron is a complex dynamical system, in the sense that its output at a given moment of time depends not only on the current inputs, but also on the previous input values.

Artificial neural networks and how they learn. If we need to predict the values of several outputs \(y_1\), ..., \(y_\ell \), ..., \(y_L\), then for each output \(y_\ell \), we train a separate neural network.

In an artificial neural networks, input signals \(x_1,\ldots ,x_n\) first go to the neurons of the first layer, then the results go to neurons of the second layer, etc.

In the simplest (and most widely used) arrangement, the second layer has linear neurons. In this arrangement, the neurons from the first layer produce the signals \(y_{\ell , k}=s_0\left( \sum \limits _{i=1}^n w_{\ell , ki}\cdot x_i-w_{\ell , k0}\right) \), \(1\le k\le K_\ell \), which are then combined into an output \(y_\ell =\sum \limits _{k=1}^K W_{\ell , k}\cdot y_k-W_{\ell , 0}\). This is called forward propagation. (In this paper, we will only describe formulas for this arrangement, since formulas for the multi-layer neural networks can be obtained by using the same idea.)

How a NN learns: derivation of the formulas. Once we have an observation \((x^{(m)}_1,\ldots ,x^{(m)}_n,y_\ell ^{(m)})\), we first input the values \(x^{(m)}_1,\ldots ,x^{(m)}_n\) into the current NN; the network generates some output \(y_{\ell ,NN}\). In general, this output is different from the observed output \(y_\ell ^{(m)}\). We therefore want to modify the weights \(W_{\ell , k}\) and \(w_{\ell , ki}\) so as to minimize the squared difference \(J{\mathop {=}\limits ^\mathrm{def}}(\varDelta y_\ell )^2\), where \(\varDelta y_\ell {\mathop {=}\limits ^\mathrm{def}}y_{\ell ,NN}-y_\ell ^{(m)}\). This minimization is done by using gradient descent, where each of the unknown values is updated as \(W_{\ell , k}\rightarrow W_{\ell , k}-\lambda \cdot \displaystyle \frac{\partial J}{\partial W_{\ell , k}}\) and \(w_{\ell , ki}\rightarrow w_{\ell , ki}-\lambda \cdot \displaystyle \frac{\partial J}{\partial w_{\ell , ki}}\). The resulting algorithm for updating the weights is known as backpropagation. This algorithm is based on the following idea.

First, one can easily check that \(\displaystyle \frac{\partial J}{\partial W_{\ell , 0}}=-2\varDelta y\), so \(\varDelta W_{\ell , 0}=-\lambda \cdot \displaystyle \frac{\partial J}{\partial W_{\ell , 0}}=\alpha \cdot \varDelta y_\ell \), where \(\alpha {\mathop {=}\limits ^\mathrm{def}}2\lambda \). Similarly, \(\displaystyle \frac{\partial J}{\partial W_{\ell , k}}=2\varDelta y_\ell \cdot y_{\ell , k}\), so \(\varDelta W_{\ell , k}=-\lambda \cdot \displaystyle \frac{\partial J}{\partial W_{\ell , k}}=2\lambda \cdot \varDelta y_\ell \cdot y_{\ell , k}\), i.e., \(\varDelta W_{\ell , k}=-\varDelta W_{\ell , 0}\cdot y_{\ell , k}\).

The only dependence of \(y_\ell \) on \(w_{\ell , ki}\) is via the dependence of \(y_{\ell , k}\) on \(w_{\ell , ki}\). So, for \(w_{\ell , k0}\), we can use the chain rule and get \(\displaystyle \frac{\partial J}{\partial w_{\ell , k0}}= \displaystyle \frac{\partial J}{\partial y_{\ell , k}}\cdot \displaystyle \frac{\partial y_{\ell , k}}{\partial w_{\ell , k0}}\), hence:

$$ \displaystyle \frac{\partial J}{\partial w_{\ell , k0}}=2\varDelta y_\ell \cdot W_{\ell , k}\cdot s'_0\left( \sum \limits _{i=1}^n w_{\ell , ki}\cdot x_i-w_{\ell , k0}\right) \cdot (-1). $$

For \(s_0(z)=1/(1+\exp (-z))\), we have \(s'_0(z)=\exp (-z)/(1+\exp (-z))^2\), i.e.,

$$ s'_0(z)=\displaystyle \frac{\exp (-z)}{1+\exp (-z)}\cdot \displaystyle \frac{1}{1+\exp (-z)}=s_0(z)\cdot (1-s_0(z)). $$

Thus, in the above formula, where \(s_0(z)=y_{\ell , k}\), we get \(s'_0(z)=y_{\ell , k}\cdot (1-y_{\ell , k})\), \(\displaystyle \frac{\partial J}{\partial w_{\ell , k0}}=-2\varDelta y_\ell \cdot W_{\ell , k}\cdot y_{\ell , k}\cdot (1-y_{\ell , k})\), and

$$ \varDelta w_{\ell , k0}=-\lambda \cdot \displaystyle \frac{\partial J}{\partial w_{\ell , k0}}=\lambda \cdot 2\varDelta y_\ell \cdot W_{\ell , k}\cdot y_{\ell , k}\cdot (1-y_{\ell , k}). $$

So, we have \(\varDelta w_{\ell , k0}=-\varDelta W_{\ell , k}\cdot W_{\ell , k}\cdot (1-y_{\ell , k})\). For \(w_{\ell , ki}\), we have

$$ \displaystyle \frac{\partial J}{\partial w_{\ell , ki}}=2\varDelta y_\ell \cdot W_{\ell , k}\cdot y_{\ell , k}\cdot (1-y_{\ell , k})\cdot x_i=-\displaystyle \frac{\partial J}{\partial w_{\ell , k0}}\cdot x_i, $$

hence \(\varDelta w_{\ell , ki}=-x_i\cdot \varDelta w_{\ell , k0}\). Thus, we arrive at the following algorithm:

Resulting algorithm. We pick some value \(\alpha \), and cycle through observations \((x_1,\ldots ,x_n)\) with the desired outputs \(y_\ell \). For each observation, we first apply the forward propagation to compute the network’s prediction \(y_{\ell ,NN}\), then we compute \(\varDelta y_\ell =y_{\ell ,NN}-y_\ell \), \(\varDelta W_{\ell , 0}=\alpha \cdot \varDelta y_\ell \), \(\varDelta W_{\ell , k}=-\varDelta W_{\ell , 0}\cdot y_{\ell , k}\), \(\varDelta w_{\ell , k0}=-\varDelta W_{\ell , k}\cdot W_{\ell , k}\cdot (1-y_{\ell , k})\), and \(\varDelta w_{\ell , ki}=-\varDelta w_{\ell , k0}\cdot x_i\), and update each weight w to \(w_\mathrm{new}=w+\varDelta w\). We repeat this procedure until the process converges.

3 How to Pre-Train a NN to Satisfies Given Constraints

To train the network, we can use any observations \((x^{(m)}_1,\ldots ,x^{(m)}_n,y_1^{(m)},\ldots ,y_L^{(m)})\) that satisfy all the known constraints.

To satisfy the constraints \(f_c(x_1,\ldots ,x_n,y_1,\ldots ,y_L)=0\), \(1\le c\le C\), means to minimize the distance from the vector of values \((f_1,\ldots ,f_C)\) to the ideal point \((0,\ldots ,0)\), i.e., equivalently, to minimize the sum \(F{\mathop {=}\limits ^\mathrm{def}}\sum \limits _{c=1}^C (f_c(x_1,\ldots ,x_n,y_1,\ldots ,y_L))^2.\) To minimize this sum, we can use a similar gradient descent idea. From the mathematical viewpoint, the only difference from the usual backpropagation is the first step: here,

$$ \frac{\partial F}{\partial W_{\ell , 0}}=2\cdot \sum _{c=1}^C f_c\cdot \frac{\partial f_c}{\partial y_\ell },\ \ \ \text{ hence }\ \ \ \varDelta W_{\ell , 0}=-\alpha \cdot \sum _{c=1}^C f_c\cdot \frac{\partial f_c}{\partial y_\ell }. $$

Once we have computed \(\varDelta W_{\ell , 0}\), all the other changes \(\varDelta W_{\ell , k}\) and \(\varDelta w_{\ell , ki}\) are computed based on the same formulas as above.

The consequence of this algorithm modification is that instead of L independent neural networks used to train each of the L outputs \(y_\ell \), now we have L dependent ones. The dependence comes from the fact that, to start a new cycle for each \(\ell \), we need to know the values \(y_1,\ldots ,y_L\) corresponding to all the outputs.

4 How to Retain Constraints When Training Neural Networks on Real Data

Once the networks is pre-trained so that the constraints are all satisfied, we need to train it on the real data. In this real-data training, we need to make sure that not only all the given data points fit, but that also all C constraints remain satisfied. In other words, on each step, we need to make sure not only that \(\varDelta y_\ell \) is close to 0, but also that \(f_c(x_1,\ldots ,x_n,y_1,\ldots ,y_L)\) is close to 0 for all \(\ell \). So, similar to the previous section, instead of minimizing \(J=(\varDelta y_\ell )^2\), we should minimize a combined objective function \(G{\mathop {=}\limits ^\mathrm{def}}J+N\cdot F\), where N is an appropriate constant, and \(F=\sum \limits _{c=1}^C f_c^2\).

Similarly to pre-training, the only difference from the usual backpropagation algorithm is that we compute the values \(\varDelta W_{\ell , 0}\) differently:

$$\varDelta W_{\ell , 0}=\alpha \cdot \left( \varDelta y_\ell -N\cdot \sum \limits _{c=1}^C f_c\cdot \frac{\partial f_c}{\partial y_\ell }\right) .$$