Abstract
Approximating a function with a learning neural network (LNN) has been considered for a long time by many authors and is known as “the universal approximation property”. The smaller the precision, the more neurons in the hidden layer one should take to reach the required precision. An other challenge for learning neural networks is understanding how to reduce the computational expense and how to select the training examples. Our contribution here is to consider an LNN as a discretization of the time-dependent transport of the identity function, Id, towards the function T to be approximated by the LNN. Using classical interpolation properties for \({\mathbb {P}}_1\) Lagrange functions, we are able to give space-time error estimates for the simple LNN we introduce.
Introduction
Artificial neural networks have become very popular in computer science in recent years, particularly for images processing and signal processing. These methods aim to approximate functions with layers neurons (or units), connected by linear operations between units and non-linear activations within units, see [8, 9, 13], and references therein. The input \(Y^k\) of a given layer k is linked to the output \(Y ^{k+1}\) of this layer through a transformation which can be expressed by \(Y ^{k+1} = Y^k + F^k(Y^k, \theta ^k)\) where \(\theta ^k\) is a parameter vector. The previous representation of a LNN is a one-step discretization of the flow \(Y^t :\) \(\frac{d}{dt} Y^t = F(Y^t,\theta _t)\) for \(t_k<t<t_{k+1}\) with \(Y^t({t_k})=Y^k\). Several studies of neural networks considered the idea of viewing such networks as discretization of ordinary differential equations (ODE), as discretized controlled ODEs; see, for example, [7, 3, 5, 6], among others or as Time-Delay differential equations [12]. In this study, we consider the proposed LNN as a discretization of the time-dependent transport of the Id function towards the function T to be approximated by the LNN, which reads: \(\frac{d}{dt} Y^t = \ln {(T)} Y^t\) for \(0<t<1\) with \(Y^t(0)=Id\).
Approximating a function (and its derivatives) with a neural network was given, for example, in a version of the universal approximation theorem in [11] (Theorem 3), and more recently in [2].
In [10], it is proved that an LeRU deep neural networks has the ability to represent the basis functions of simplicial finite element of order one, allowing so to approximate the second-order PDE problems which can be formulated as minimizing an energy functional.
In this paper, we aim to make progress towards a theoretical understanding of the approximation accuracy for a simple learning neural networks (LNN) using the classical tools of numerical analysis. The starting point of an LNN is a dataset of input-output pairs \(\{ X_i, T(X_i) \}_{i=1}^p\), called sampling points, or learning examples, where T is an unknown operator with values only known at sample points \(X_i\). As an example, \(X_i\) can be thought of as an image and \(T(X_i)\) as a transformed image.
Images are represented by vectors (the pixel columns of image are aligned in a vector). Thus, the unknown operator T is handled component wise, and therefore, in what is following, T will denote the unknown operator with values in \({\mathbb {R}}\). A feedforward architecture is commonly used for neural network (NN) which implements the operator T as a sequence of linear transformations followed by a component-wise application of a non-linear function (the positive part function \(x \mapsto x^+\)), called activation function. The length N of the sequence is referred to as the number of layers of the network and the input of a layer m, \(1 \le m \le N\), consists of the output of the preceding layer \(Y^{m-1} \in IR^p\), where p is the number of the neurons of the layer m; the output of this layer is \(Y^{m} =\left( W^mY^{m-1} \right) ^+\) with \(W^m\) the weight matrix in \({\mathbb {M}}_{p\times p}\). A such NN is said to have p channels. The input layer of the LNN has to be specified (how to distribute the input X to every channel), and the last one consists of adding the contribution of each channel. For a given input X, the output of the LNN is denoted \(Y_X\), and the matrices \(W^m\) are determined by minimizing the quadratic error to the learning examples \(\{ X_i, T(X_i) \}_{i=1}^p\) :
In this work, we assume that the following holds true:
-
H1
the LNN has the same number of channels (weight matrix rows) as the sample points number;
-
H2
for the first layer, the interaction between the neurons i and j is the same as the one between the neurons j and i; matrix \(W^1\) has real entries and is symmetric;
-
H3
the weight matrices \(W^m\) do not depend on the layers: \(W^m=W^1=W\) for \(1\le m \le N\);
-
H4
the operator T and vectors X are positive (component wise).
Since the weight matrix is symmetric, the spectral theorem says that it is diagonalizable. Therefore, in what is following we consider independent channel for the LNN (that is to say a diagonal matrix W) Fig. 1.
The error estimates for our specific LNN we provide in this paper read for the 1d case with N layers when the sample points \(\{ X_i\}_{i=1}^p\) are equidistant, \(h =\frac{X_p-X_1}{p-1}\); \(I=[X_1, X_p]\subset {\mathbb {R}}_+^*,\) \(T\in H^2(I; {\mathbb {R}}_+^*)\) :
The paper is organized as follows. In section one, the one-dimensional case is analyzed. First, the input layer of the NN is specified, and the nearest-neighbor strategy is presented for equidistant sampling points. Giving an estimate in \(L^2\) norm of the domain of the operator, T is made possible, because the NN is identified to the temporal transport of the identity towards the operator T: \(t \mapsto T^t, \, 0 \le t \le 1\). The case of irregularly distributed sampling points is then considered. If the nearest-neighbor strategy is adapted, that is to say if an appropriate distance is introduced, the irregularly distributed sampling points reduces to the regularly distributed one. Following the method of generalized finite element introduced by Babuska [1], without using the trick of changing the distance in the nearest-neighbor strategy, the same error estimate is obtained. In section 2, the d-multi-dimensional case is handled. Building some d-simplex with the nearest-neighbor strategy is possible, provided that the sampling points have a certain regular distribution. Thanks to a multi-point Taylor formula due to Ciarlet, a point-wise error estimate is available for \({\mathbb {P}}_1\) Lagrange interpolation (see Waldron [14] for a general presentation). Therefore, we propose a local error estimate in that d-multi-dimensional case. All the sampling points are not used, so adaptive techniques could be developed, but that is out of the scope of this paper. In the last section, some numerical simulations are given to substantiate the theoretical results.
The 1d Case
Let \(\{ X_i\}_{i=1}^p\) be the equidistant sampling points, \(h =\frac{X_p-X_1}{p-1}; \, X_i = X_1 +(i-1)h ; 1 \le i \le p\), \(I=[X_1, X_p] \subset {\mathbb {R}}_+^*\), \(T: I \mapsto {\mathbb {R}}_+^*\) the unknown operator. The LNN we propose has \(0<N\) layers and \(0<p\) independent channels. First, we describe how the input \(X\in I\) contributes to the i channel of the LNN as \(\varPhi _i(X)\) for \(1 \le i \le p\). The 2-nearest-neighbor strategy is used.
Definition 1
(input contribution) For all \(X \in I\), the p-vector components \(\left( \varPhi _1(X), \cdots , \varPhi _p(X) \right) ^t\) are the X barycentric coordinates of the two nearest neighbors in \(\{ X_i\}_{i=1}^p\), completed by zero.
For the output layer, we have the following.
Definition 2
(LNN output) Let \(Y_X^N\) be the output of the last layer N of the LNN for the input X. \(Y_X\) is the sum of the \(Y_X^N\) components: \(Y_X =\sum _{i=1}^p Y^N_{X_i}\)
Let \(\varDelta t =\dfrac{1}{N}\); the LNN channels are defined by the following.
Definition 3
(LNN channel) Let \(\varPhi _i(X)\) be given, and for \(1 \le i \le p\), the LNN i-channel is defined by:
A straightforward calculation yields [4]:
Lemma 1
(LNN input) Let \(I \subset {\mathbb {R}}_+^*\) be a compact subset, \(\{ X_i\}_{i=1}^p\) the equidistant sampling points in I with \(h=\frac{X_{1} - X_{p}}{p-1}\). Then the \(\varPhi _i(\cdot )\) functions are the \({\mathbb {P}}_1\)-Lagrange interpolant functions associated with the grid \(\{ X_i\}_{i=1}^p\).
Since the functions \(\varPhi _i(\cdot )\) are non-negative, the i-channel definition reads:
We have from 4::
and gathering all channels contribution provides:
To get the final expression for \(Y_X\), we have to determine the \(a_i\) coefficients by minimizing the quadratic error to the learning examples \(\{ X_i, T(X_i) \}_{i=1}^p\):
Lemma 2
The minimum of the quadratic error to the learning examples L(a) is asymptotically with respect to N reached for:
proof
Proof
We have to solve for \(1 \le l \le p\):
Thanks to the property of \({\mathbb {P}}_1\)-Lagrange interpolant functions:
we get for \(1 \le l \le p\):
Using asymptotic expansion for the \(\ln (1+ z)^+=\ln (1+ z) + \ln (sgn^+(1+ z))\) function in zero, we get:
where \(sgn^+(z)=\left\{ \begin{array}{lr} 1 &{} \text { if } 0 \le z \\ 0 &{} \text { otherwise } \end{array} \right.\) \(\square\)
The expression of the LNN output 6 becomes:
Now, we give the main result of this section.
Theorem 1
Assume that \(I\subset {\mathbb {R}}_+^*\) is compact, \(T\in H^2(I; {\mathbb {R}}_+^*)\), \(\{ X_i\}_{i=1}^p\) are the ordered sampling points in I with \(h=\frac{\vert X_{p} - X_{1}\vert }{ p-1}\), \(N_0= \sup _{1\le i \le p}\vert \ln {T(X_i)}\vert\), and \(Y_{(\cdot )}\) is the output of the LNN (see 11 ). Then, the following asymptotic error estimate holds true for \(N_0 \le N\):
Proof
Let \(\varPi _h\) denote the \({\mathbb {P}}_1\)-Lagrange interpolant associated with the grid \(\{ X_i\}_{i=1}^p\) for continuous functions defined on I:
From interpolation results in Sobolev’s spaces, we have (see [4] theorem 3.1.4 p. 121):
For \(N_0 \le N\) and large enough, using an asymptotic expansion for the \(\ln (1+ z)\) function in zero, we get from 11:
Thus:
and
Triangular inequality gives the result. \(\square\)
Remark 1
The proposed LNN is a realization of the time-dependent transport of the Id operator towards the operator T defined by: \(t \mapsto T^t\) for \(0\le t \le 1\), or equivalently: \(Y'(t)=\ln {(T)} Y(t), \, 0< t \le 1; Y(0) =Id.\)
The \(1-d\) Case with Non-equidistant Sampling Points and the Two-Neighbor Signed Distance Strategy
Let \(I \subset {\mathbb {R}}_+^*\) be a compact subset, \(\{ X_i\}_{i=1}^p\) the sampling points in I, \(h= \sup _{2\le i \le p} \vert X_{i} - X_{i-1}\vert\). For \(2\le i \le p-1\), let denote by:
the extended global Lagrange basis functions of degree one associated with the grid \(\{ X_i\}_{i=1}^p\). Define the signed distance \(d_S\) to \(\{ X_i\}_{i=1}^p\) by:
The two nearest neighbors for the signed distance of \(X \in I\) are the two end points of the interval \([X_{i-1}, X{_i}]\) of which X belongs to.
Definition 4
(input contribution for irregular grid) For all \(X \in I\), the p-vector components \(\left( \varPhi _1(X), \ldots , \varPhi _p(X) \right) ^t\) are the X barycentric coordinates of the two nearest neighbors for the signed distance in \(\{ X_i\}_{i=1}^p\) completed by zero.
For all \(X \in I\), the p-vector components \(\left( \varPhi _1(X), \ldots , \varPhi _p(X) \right) ^t\) are the values of the \({\mathbb {P}}_1\) Lagrange basis functions.
Using the same LNN and arguing in the same way as in the case of equidistant sampling points, we get the same convergence result as in Theorem 1.
The \(1-d\) Case with Non-equidistant Sampling Points, Localized Generalized Finite Element, and Two-Nearest-Neighbor Strategy
Here, we are following the method of Generalized finite element introduced by Babuska et all in [1]. Let \(I \subset {\mathbb {R}}_+^*\) be a compact subset, \(\{ X_i\}_{i=1}^p\) the sampling points in I, \(h= \sup _{2\le i \le p} \vert X_{i} - X_{i-1}\vert\). Let us denote by \(\varPhi _i\) the global basis functions of the \({\mathbb {P}}_1\)-Lagrange interpolant associated with the sampling points \(\{ X_i\}_{i=1}^p\). Set \(\omega _i={Supp} (\varPhi _i)\), the \(\varPhi _i\)’s support, for \(i=\{1,p\}\), which are patches satisfying the following properties:
-
1.
\(I =\cup _{i=1}^p \omega _i\); \(\varPhi _i=0\) on \(I\setminus \omega _i\);
-
2.
\(\sum _{i=1}^p \varPhi _i(x)=1\); \(\forall x\in I\);
-
3.
there exists \(0< C_1\) such as \(\underset{x\in I}{\max }\vert \varPhi _i(x) \vert \le 1\); for \(1 \le i \le p.\)
This partition of unity on I has piecewise continuous derivatives. The generalized finite element is defined by the following:
Definition 5
(Generalized finite element) The geometric support of the ith finite element is \(Supp(\varPhi _i)\) and the local polynomial space is \({\mathbb {P}}_1\), the basis of which is \(\{\varphi _1^i, \varphi _2^i\}\), the barycentric coordinates of the two nearest neighbors in \(\{ X_k\}_{k=i-1}^{i+1} \subset Supp(\varPhi _i)\) for \(1\le i \le p\) Fig. 2.
The global approximation space \(V_h\) is given by:
We have:
Lemma 3
For all \(\omega _i\) let denote by \(\varPi _{i} \in {{{\mathcal {L}}}} (H^2(\omega _i); L^2(\omega _i))\) the \({\mathbb {P}}_1\) interpolation operator defined by \(\forall x \in \omega _i\), \(\varPi _{i} v(x) = \sum _{k=1}^2 \varphi _k^i(x) v(X_{i_k}),\) with \(X_{i_1}, X_{i_2}\) the two nearest neighbors in \(\{ X_k\}_{k=i-1}^{i+1}\) . Assume that there exists \(0<M\), such that \(\frac{h}{\underset{1 \le i \le p-1}{Inf }\vert X_{i+1} -X_i \vert } \le M\), and then, there exists \(0< C\) such that for all \(v \in H^2(\omega _i)\) we have the following estimate:
where \(\vert \cdot \vert _{H^2(\omega _i)}\)is the \(H^2(\omega _i)\) semi-norm.
Proof
Using the technique of returning to a reference interval \(K_i= [ \frac{X_{i-1}-X_i}{X_{i+1}-X_i} , 1]\) with an affine transformation \({{\hat{F}}}({{\hat{x}}})= X_i + {{\hat{x}}} (X_{i+1}-X_i)\), we set \({\hat{v}}=v\circ {\hat{F}}\) and define an interpolation operator \(\hat{\varPi }_i\), such that \({{\hat{\varPi }}}_i{\hat{v}}=\varPi _iv\circ {\hat{F}}\). Using the change of variable \({\hat{F}}\), we have:
Since \({\hat{v}}- {{\hat{\varPi }}}_i {\hat{v}}\) vanishes for \({\hat{x}} =\{0, 1\}\), using Rolle’s theorem and integrating two times, we get:
where \(\vert K_i \vert\) denotes the length of \(K_i\) which is controlled uniformly with respect to the index i using the hypothesis regarding the mesh. Now, we use the change of variable \({\hat{F}}^{-1}\) to get a bound with the Sobolev’s \(H^2\) semi-norm \(\vert \cdot \vert _{H^2(\omega _i)}\) which provides a \(\frac{h^4}{\vert X_{i+1}-X_i\vert }\) factor in the previous inequality, and thus, we get the announced result. \(\square\)
For the global interpolation operator, we have:
Lemma 4
let \(\varPi _h : H^2(I) \longrightarrow V_h \subset L^2(I)\) be defined by:
where \(\varPi _i\) is defined in Lemma 3. Assume that there exists \(0<M\), such that \(\frac{h}{\underset{1 \le i \le p-1}{Inf }\vert X_{i+1} -X_i \vert } \le M\), and then, there exists \(0< C_2\), such that for all \(v \in H^2(I)\), we have the following estimate:
Proof
Using the fact that any \(x\in I\) belongs at most to two subdomains \(\omega _i\), we see that \(\sum _{i=1}^p \varPhi _i (v- \varPi _{i} v)\) has at most two terms for any \(x \in I\). Hence, the Cauchy–Schwarz inequality provides:
Then, we have:
which with Lemma 3 gives:
\(\square\)
For accounting for the irregularity of the grid \(\{ X_i\}_{i=1}^p\), we have to modify the LNN as follows:
Definition 6
(input contribution irregular case) For all \(X \in I\), the p-vector components \(\left( \psi _1(X), \cdots , \psi _p(X) \right) ^t\) are defined by:
where \(\varphi ^i_k(X)\) are the X barycentric coordinates of the two nearest neighbors in \(\{ X_k\}_{k=i-1}^{i+1} \subset Supp(\varPhi _i)\) for \(1\le i \le p\).
Unfortunately, functions \(\psi\) are not always positives, and thus, we modify our algorithm.
Definition 7
(LNN channel) Let \(\psi _i(X)\) be given, and for \(1 \le i \le p\), the LNN i-channel is defined by:
We have from 14:
and gathering all channels contribution provides:
Since \(\psi _l(X_q)= \left\{ \begin{array}{lr} 1 &{} \text { if } l=q \\ 0 &{} \text { otherwise } \end{array} \right.\) for \(1 \le l,q \le p\), then Lemma 2 applies and we get \(a_i = \ln T(X_i) + \mathop {}\mathopen {}{\mathcal {O}}\mathopen {}\left( \dfrac{1}{N}\right)\) for \(1\le i\le p\), and finally:
Theorem 2
Assume that \(I\subset {\mathbb {R}}_+^*\) is compact, \(T\in H^2(I; {\mathbb {R}}_+^*)\), \(\{ X_i\}_{i=1}^p\) are an irregular sampling points in I with \(h=\sup _{2 \le i \le p} {\vert X_{i} - X_{i-1}}\vert\), and there exists \(0<M\), such that \(\frac{h}{\underset{1 \le i \le p-1}{Inf }\vert X_{i+1} -X_i \vert } \le M\). Let \(N_0= \sup _{1\le i \le p}\vert \ln {T(X_i)}\vert\) and \(Y_{(\cdot )}\) be the output of the LNN (see 17). Then, the following asymptotic error estimate holds true for \(N_0 \le N\):
Proof
Let \(\varPi _h\) denote the interpolation operator defined in Lemma 4 associated with the grid \(\{ X_i\}_{i=1}^p\) for continuous functions defined on I:
From Lemma 4, we have:
for \(N_0 \le N\) and large enough; from 16, we have:
The estimate for \(I -\varPi _h\) reads:
Triangular inequality yields:
which gives the result.
\(\square\)
The Multi-dimensional Case
Let \(\varOmega \subset ({\mathbb {R}}_+^*)^d\) be a bounded Lipschitzian open subset, located at a distance \(0<h_0\) of the boundary of \(({\mathbb {R}}_+^*)^d\), containing the convex hull of the sampling points \(\{ X_i\}_{i=1}^p\). The domain of operator T is \(({\mathbb {R}}_+^*)^d\), and T is assumed to be \(C^2\) with bounded second derivatives.
In this section, it is assumed that the following hypothesis holds true:
-
H5 there exists \(0<h \le h_0\), such that \(\forall X \in \varOmega\), there exists \(d+1\) nearest neighbors of X among \(\{ X_i\}_{i=1}^p\) at a distance at most h (i.e., belonging to the ball B(X, h)) which constitute a non-degenerated d-simplex \(K_X\). Moreover, there exists \(0< {{\underline{\rho }}}\) the radius of the largest ball included in \(K_X\) \(\forall X \in \varOmega\), such that \(\frac{h}{ {{\underline{\rho }}}}\) is bounded.
For the multi-dimensional case, we argue locally. Introduce a local interpolation space for all \(X \in \varOmega\) be fixed.
Definition 8
(local interpolation space) Define \({\mathbb {P}}_1 (B(X, h)) =Span \{\varphi _j \}_{j=1}^{d+1}\) where \(\varphi _j\) denotes the \({\mathbb {P}}_1\)-Lagrange function associated to the j vertex of the d-simplex \(K_X\), \(A_j \in \{ X_i\}_{i=1}^p\). We have \(\forall f \in C^0({\overline{B}}(X, h)\) \(\varPi _h f(\cdot ) = \sum _{j=1}^{d+1}\varphi _j(\cdot ) f(A_j)\).
Now, we give the error estimate obtained with a multi-point Taylor formula see [4] p. 128 or [14] section 6.
Lemma 5
Assume \(u \in C^2({{\overline{\varOmega }}})\), and there exists \(0< C\) for all \(X \in \varOmega\), such that, \(\forall x \in B(X, h)\) we have the following estimate:
The hypothesis \(\frac{h}{ {{\underline{\rho }}}}\) remains bounded which is used to have a uniform bound for the \({\mathbb {P}}_1\)-Lagrange functions in B(X, h). Let us give the input contribution to the LNN for the d-dimension case.
Definition 9
(input contribution in d-dimension) For all \(X \in \varOmega\), the p-vector \(\left( \varPhi _1(X), \cdots , \varPhi _p(X) \right) ^t\) are the barycentric coordinates of X relative to \(K_X\), the d-simplex of nearest neighbors of X in \(\{ X_i\}_{i=1}^p\), completed by zero.
The functions \(\varPhi _p(X)\) are not all non-negative, since we have not necessary \(X \in K_X\), and thus, the LNN is given by the following:
Definition 10
Let \(\varPhi _i(X)\) be given, for \(1 \le i \le p\), the LNN i-channel is defined by:
We have from 21:
and gathering all channels contribution provides:
Since \(\varPhi _l(X_q)= \left\{ \begin{array}{lr} 1 &{} \text { if } l=q \\ 0 &{} \text { otherwise } \end{array} \right.\) for \(1 \le l,q \le p\), such that \(X_l; X_q\) are vertices of \(K_X\), then Lemma 2 applies, and we get \(a_i = \ln T(X_i) + \mathop {}\mathopen {}{\mathcal {O}}\mathopen {}\left( \dfrac{1}{N}\right)\) for \(1\le i\le p\), such that \(X_i\) is a vertex of \(K_X\), and finally:
Thanks to Lemma 5 and triangular inequality, we have the following error estimate:
Theorem 3
The operator \(T\, ({\mathbb {R}}_+^*)^d \longrightarrow {\mathbb {R}}\) is assumed to be \(C^2\) with bounded second derivatives. Let \(N_0= \sup _{1\le i \le p}\vert \ln {T(X_i)}\vert\) and \(Y_{(\cdot )}\) be the output of the LNN (see 24 ). Then, there exists \(0<C_d\), such that the following asymptotic error estimate holds true for \(N_0 \le N\):
Proof
for \(N_0 \le N\) and large enough, from 23, we have:
The estimate for \((\varPi _h-I)T\) reads:
Triangular inequality yields:
\(\square\)
Remark 2
Introduce the covering metric radius function:
and then, a global interpolation operator can be built as in one dimension case using a partition of unity if we assume that:
-
let \(h=R(r, \varOmega )\), \(\forall x \in \varOmega\), there exists at most \(\chi\) indexes \(i\in \{1, \cdots , r \}\), such that balls \(B(x_i, h) \ni x\);
-
concerning the sampling points \(\{ X_i\}_{i=1}^p\), \(\forall i \in \{1, \cdots , r \}\) the \(d+1\) nearest neighbors of \(x_i\) among \(\{ X_i\}_{i=1}^p\) located at a distance at most h of \(x_i\) constitute a non-degenerated d-simplex \(K_i\). Moreover, there exists \(0< {{\underline{\rho }}}\) the radius of the largest ball included in \(K_i\), \(i\in \{1, \cdots , r \}\), such that \(\frac{h}{ {{\underline{\rho }}}}\) is bounded.
For all \(1 \le i \le r\), define \({\mathbb {P}}_1 (B(x_i, h)) =Span \{\varphi _j^i \}_{j=1}^{d+1}\) where \(\varphi _j^i\) denotes the \({\mathbb {P}}_1\)-Lagrange function associated with the j vertex \(A_j \in \{ X_i\}_{i=1}^p\) of the d-simplex \(K_i \subset B(x_i, h)\). We have \(\forall f \in C^0({\overline{B}}(x_i, h)\) \(\varPi _h^i f(\cdot ) = \sum _{j=1}^{d+1}\varphi _j^i(\cdot ) f(A_j)\). The local interpolation space is \(V_h=\varPi _{i=1}^r {\mathbb {P}}_1 (B(x_i, h))\).
Remark 3
The given error estimates are not optimal, and using the Averaged Taylor polynomial, it is possible to get estimates in Sobolev’s norms.
Some Numerical Results
In this section, some numerical simulations are given to substantiate the theoretical results. For the 1-d case, let \(T\, [1,3] \rightarrow {\mathbb {R}}\) be defined by \(T(x) =3 x^2\). Table 1 gives the \(L^1\)-error computed with a Simpson numerical integration for different numbers of sampling points and for \(N=10^4\) layers of neurons.
The order of convergence is \(h^2\) as asserted by the theoretical results.
For the d=800 case, let T be defined by \(T(x) =\sum _{k=1}^{800} x_k^{ (2 + (-1)^k)}\). We consider four cases constituted of 4000 learning examples randomly distributed on an hypercube centered at \(X=(1.1, \cdot 1.1)^t\) the step of which is, respectively, 0.05; 0.1; 0.2; 0.4. The error is given at the point X and at \(X_{K_X}\) the barycentre of the simplex \(K_X\) for \(N=10^5\) layers of neurons Tables 2 and 3.
Let us remark that for all computations in dimension \(d=800\), the point \(X\notin K_X\). For the barycenter of the simplex \(K_X\), it can be checked that the order of convergence is \(h^2\) for the two sampling points numbers. At the point X, the order of convergence deteriorates as h becomes too smaller. This comes from the fact that we have very few control regarding the similarity between \(K_X\) and the ball \(B_X\). It can be checked that if the maximum of the barycentric coordinates is too big, regarding the barycentric coordinate of the barycenter \(X_{K_X}\), then the approximation deteriorates.
Conclusion
By relating a peculiar LNN with a time-dependent transport problem, we provide an error estimate for this LNN using \({\mathbb {P}}_1\) Lagrange interpolant. The space and time are shown to have different roles in the error estimate. Thanks to the generalized finite-element method originally presented by I. Babuska, a strategy of nearest neighbors is proved to be efficient for handling the space accuracy in case of one dimension as in case of many dimensions. Moreover, the nearest-neighbor strategy is local and allows to reduce the computational expense of the proposed method.
In d-dimensions, the error estimate presented in Theorem 3 is not optimal, using the Averaged Taylor polynomial, it is possible to get better estimates in Sobolev’s norms.
References
- 1.
Babuska Ivo, Banerjee Uday, Osborn Jhon. Generalized finite element methods-main ideas, results and perspective. Int J Comput Methods. 2004;1(1):67–103.
- 2.
Cao Feilong, Xie Tingfan, Zongben Xu. The estimate for approximation error of neural networks: A constructive approach. Neurocomputing. 2008;71:626–30.
- 3.
Chang B, Meng L, Haber E, Tung F, Begert D. Multi-level residual networks from dynamical systems view.arXiv:1710.10348, (2017).
- 4.
Ciarlet Philippe G. The finite element method for ellipticproblems; Studies in mathematics and its applications: Theory,Methods and Applications. : North-Holland; 1976.
- 5.
Chen TQ, Rubanova Y, Bettencourt J, Duvenaud DK. Neural ordinary differential equations. Advances in Neural Information Processing Systems, 2018;6571-6583.
- 6.
Cuchiero Christa, Larsson Martin, Teichmann Josef. Deep neural networks, generic universal interpolation, and controlled ODEs.arXiv:1908.07838 [math.OC] (2019).
- 7.
Filici Cristian. Error estimation in the neural network solution of ordinary differential equations. Neur Netw. 2010;23:614–7.
- 8.
Goodfellow I, Bengio Y, Courville A. Deep learning, MIT Press, (2016), available from http://www.deeplearningbook.org
- 9.
Guliyev Namig J, Vugar E. Ismailov On the approximation by single hidden layer feedforward neural networks with fixed weights. Neural Networks 98 2018;296-304.
- 10.
He J, Li L, Xu J, Zheng C. ReLU deep neural networks and linear finite elements. arXiv preprint arXiv:1807.03973, (2018).
- 11.
Hornik K. Approximation capabilities of multilayer feedforward networks. Neural Netw. 1991;4(2):251–7.
- 12.
Larger L, Baylon-Fuentes A, r; Martinenghi VS, Udatov YK, Chembo, Jacquot M. High-Speed Photonic Reservoir Computing Using a Time Delay-Based Architecture: Million Words per second Classification. PHYS. REV. X 7, 2017;011015.
- 13.
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44.
- 14.
Waldron Shayne. Multipoint Taylor formulæ. Numer. Math. 1998;80:461–94.
Author information
Affiliations
Corresponding author
Ethics declarations
Conflict of Interest:
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Benmansour, K., Pousin, J. Nearest Neighbors Strategy, \({\mathbb {P}}_1\) Lagrange Interpolation, and Error Estimates for a Learning Neural Network. SN COMPUT. SCI. 2, 38 (2021). https://doi.org/10.1007/s42979-020-00409-3
Received:
Accepted:
Published:
Keywords
- Learning neural network
- Transport operator
- \({\mathbb {P}}_1\) Lagrange interpolation
- Error estimates
- Nearest-neighbor strategy
Mathematics Subject Classification
- 65L09
- 65D05
- 65G99