Nearest Neighbors Strategy, \({\mathbb {P}}_1\) Lagrange Interpolation, and Error Estimates for a Learning Neural Network

Abstract

Approximating a function with a learning neural network (LNN) has been considered for a long time by many authors and is known as “the universal approximation property”. The smaller the precision, the more neurons in the hidden layer one should take to reach the required precision. An other challenge for learning neural networks is understanding how to reduce the computational expense and how to select the training examples. Our contribution here is to consider an LNN as a discretization of the time-dependent transport of the identity function, Id, towards the function T to be approximated by the LNN. Using classical interpolation properties for \({\mathbb {P}}_1\) Lagrange functions, we are able to give space-time error estimates for the simple LNN we introduce.

Introduction

Artificial neural networks have become very popular in computer science in recent years, particularly for images processing and signal processing. These methods aim to approximate functions with layers neurons (or units), connected by linear operations between units and non-linear activations within units, see [8, 9, 13], and references therein. The input \(Y^k\) of a given layer k is linked to the output \(Y ^{k+1}\) of this layer through a transformation which can be expressed by \(Y ^{k+1} = Y^k + F^k(Y^k, \theta ^k)\) where \(\theta ^k\) is a parameter vector. The previous representation of a LNN is a one-step discretization of the flow \(Y^t :\) \(\frac{d}{dt} Y^t = F(Y^t,\theta _t)\) for \(t_k<t<t_{k+1}\) with \(Y^t({t_k})=Y^k\). Several studies of neural networks considered the idea of viewing such networks as discretization of ordinary differential equations (ODE), as discretized controlled ODEs; see, for example, [7, 3, 5, 6], among others or as Time-Delay differential equations [12]. In this study, we consider the proposed LNN as a discretization of the time-dependent transport of the Id function towards the function T to be approximated by the LNN, which reads: \(\frac{d}{dt} Y^t = \ln {(T)} Y^t\) for \(0<t<1\) with \(Y^t(0)=Id\).

Approximating a function (and its derivatives) with a neural network was given, for example, in a version of the universal approximation theorem in [11] (Theorem 3), and more recently in [2].

In [10], it is proved that an LeRU deep neural networks has the ability to represent the basis functions of simplicial finite element of order one, allowing so to approximate the second-order PDE problems which can be formulated as minimizing an energy functional.

In this paper, we aim to make progress towards a theoretical understanding of the approximation accuracy for a simple learning neural networks (LNN) using the classical tools of numerical analysis. The starting point of an LNN is a dataset of input-output pairs \(\{ X_i, T(X_i) \}_{i=1}^p\), called sampling points, or learning examples, where T is an unknown operator with values only known at sample points \(X_i\). As an example, \(X_i\) can be thought of as an image and \(T(X_i)\) as a transformed image.

Images are represented by vectors (the pixel columns of image are aligned in a vector). Thus, the unknown operator T is handled component wise, and therefore, in what is following, T will denote the unknown operator with values in \({\mathbb {R}}\). A feedforward architecture is commonly used for neural network (NN) which implements the operator T as a sequence of linear transformations followed by a component-wise application of a non-linear function (the positive part function \(x \mapsto x^+\)), called activation function. The length N of the sequence is referred to as the number of layers of the network and the input of a layer m, \(1 \le m \le N\), consists of the output of the preceding layer \(Y^{m-1} \in IR^p\), where p is the number of the neurons of the layer m; the output of this layer is \(Y^{m} =\left( W^mY^{m-1} \right) ^+\) with \(W^m\) the weight matrix in \({\mathbb {M}}_{p\times p}\). A such NN is said to have p channels. The input layer of the LNN has to be specified (how to distribute the input X to every channel), and the last one consists of adding the contribution of each channel. For a given input X, the output of the LNN is denoted \(Y_X\), and the matrices \(W^m\) are determined by minimizing the quadratic error to the learning examples \(\{ X_i, T(X_i) \}_{i=1}^p\) :

$$\begin{aligned} \underset{ W^m 1 \le m\le N }{\text {Argmin}} \frac{1}{2p} \sum _{i=1}^p\left( Y_{X_i} -T(X_i) \right) ^2 . \end{aligned}$$
(1)

In this work, we assume that the following holds true:

  1. H1

    the LNN has the same number of channels (weight matrix rows) as the sample points number;

  2. H2

    for the first layer, the interaction between the neurons i and j is the same as the one between the neurons j and i; matrix \(W^1\) has real entries and is symmetric;

  3. H3

    the weight matrices \(W^m\) do not depend on the layers: \(W^m=W^1=W\) for \(1\le m \le N\);

  4. H4

    the operator T and vectors X are positive (component wise).

Since the weight matrix is symmetric, the spectral theorem says that it is diagonalizable. Therefore, in what is following we consider independent channel for the LNN (that is to say a diagonal matrix W) Fig. 1.

Fig. 1
figure1

LNN

The error estimates for our specific LNN we provide in this paper read for the 1d case with N layers when the sample points \(\{ X_i\}_{i=1}^p\) are equidistant, \(h =\frac{X_p-X_1}{p-1}\); \(I=[X_1, X_p]\subset {\mathbb {R}}_+^*,\) \(T\in H^2(I; {\mathbb {R}}_+^*)\) :

$$\begin{aligned} \Vert T (\cdot ) -Y_{(\cdot )}\Vert _{L^2(I)} \le C\left( h^2 + \frac{1}{N} \right) . \end{aligned}$$
(2)

The paper is organized as follows. In section one, the one-dimensional case is analyzed. First, the input layer of the NN is specified, and the nearest-neighbor strategy is presented for equidistant sampling points. Giving an estimate in \(L^2\) norm of the domain of the operator, T is made possible, because the NN is identified to the temporal transport of the identity towards the operator T: \(t \mapsto T^t, \, 0 \le t \le 1\). The case of irregularly distributed sampling points is then considered. If the nearest-neighbor strategy is adapted, that is to say if an appropriate distance is introduced, the irregularly distributed sampling points reduces to the regularly distributed one. Following the method of generalized finite element introduced by Babuska [1], without using the trick of changing the distance in the nearest-neighbor strategy, the same error estimate is obtained. In section 2, the d-multi-dimensional case is handled. Building some d-simplex with the nearest-neighbor strategy is possible, provided that the sampling points have a certain regular distribution. Thanks to a multi-point Taylor formula due to Ciarlet, a point-wise error estimate is available for \({\mathbb {P}}_1\) Lagrange interpolation (see Waldron [14] for a general presentation). Therefore, we propose a local error estimate in that d-multi-dimensional case. All the sampling points are not used, so adaptive techniques could be developed, but that is out of the scope of this paper. In the last section, some numerical simulations are given to substantiate the theoretical results.

The 1d Case

Let \(\{ X_i\}_{i=1}^p\) be the equidistant sampling points, \(h =\frac{X_p-X_1}{p-1}; \, X_i = X_1 +(i-1)h ; 1 \le i \le p\), \(I=[X_1, X_p] \subset {\mathbb {R}}_+^*\), \(T: I \mapsto {\mathbb {R}}_+^*\) the unknown operator. The LNN we propose has \(0<N\) layers and \(0<p\) independent channels. First, we describe how the input \(X\in I\) contributes to the i channel of the LNN as \(\varPhi _i(X)\) for \(1 \le i \le p\). The 2-nearest-neighbor strategy is used.

Definition 1

(input contribution) For all \(X \in I\), the p-vector components \(\left( \varPhi _1(X), \cdots , \varPhi _p(X) \right) ^t\) are the X barycentric coordinates of the two nearest neighbors in \(\{ X_i\}_{i=1}^p\), completed by zero.

For the output layer, we have the following.

Definition 2

(LNN output) Let \(Y_X^N\) be the output of the last layer N of the LNN for the input X. \(Y_X\) is the sum of the \(Y_X^N\) components: \(Y_X =\sum _{i=1}^p Y^N_{X_i}\)

Let \(\varDelta t =\dfrac{1}{N}\); the LNN channels are defined by the following.

Definition 3

(LNN channel) Let \(\varPhi _i(X)\) be given, and for \(1 \le i \le p\), the LNN i-channel is defined by:

$$\begin{aligned} \left\{ \begin{array}{ll} Y_{i}^{m+1} &{} = \left[ (1 + \varDelta t \, a_i)) Y_{k}^{m}\right] ^+\quad 0<m\le n-1 \\ Y_{i}^{0}&{}= \varPhi _i(X). \end{array} \right. \end{aligned}$$
(3)

A straightforward calculation yields [4]:

Lemma 1

(LNN input) Let \(I \subset {\mathbb {R}}_+^*\) be a compact subset, \(\{ X_i\}_{i=1}^p\) the equidistant sampling points in I with \(h=\frac{X_{1} - X_{p}}{p-1}\). Then the \(\varPhi _i(\cdot )\) functions are the \({\mathbb {P}}_1\)-Lagrange interpolant functions associated with the grid \(\{ X_i\}_{i=1}^p\).

Since the functions \(\varPhi _i(\cdot )\) are non-negative, the i-channel definition reads:

$$\begin{aligned} \left\{ \begin{array}{ll} Y_{i}^{m+1} &{} = \left[ (1 + \varDelta t \, a_i)) \right] ^+Y_{k}^{m}\quad 0<m\le n-1 \\ Y_{i}^{0}&{}= \varPhi _i(X). \end{array} \right. \end{aligned}$$
(4)

We have from 4::

$$\begin{aligned} Y_{i}^{N}= \exp {\left( N \ln {\left[ 1 + \varDelta t \, a_i \right] ^+}\right) } \varPhi _i(X), \end{aligned}$$
(5)

and gathering all channels contribution provides:

$$\begin{aligned} Y_{X}= \sum _{i=1}^p\exp {\left( N \ln {\left[ 1 + \varDelta t \, a_i \right] ^+}\right) } \varPhi _i(X). \end{aligned}$$
(6)

To get the final expression for \(Y_X\), we have to determine the \(a_i\) coefficients by minimizing the quadratic error to the learning examples \(\{ X_i, T(X_i) \}_{i=1}^p\):

$$\begin{aligned}&\underset{ a_i 1 \le i \le p }{\text {Argmin}} \, \frac{1}{2p} \sum _{k=1}^p\nonumber \\&\qquad \left( \sum _{i=1}^p\exp {\left( N \ln {\left[ 1 + \varDelta t \, a_i \right] ^+}\right) } \varPhi _i(X_k) -T(X_k) \right) ^2 \nonumber \\&\quad = \underset{ a_i 1 \le i \le p }{\text {Argmin}} \, L(a). \end{aligned}$$
(7)

Lemma 2

The minimum of the quadratic error to the learning examples L(a) is asymptotically with respect to N reached for:

$$\begin{aligned} a_i = \ln {T(X_i)} + \mathop {}\mathopen {}{\mathcal {O}}\mathopen {}\left( \dfrac{1}{N}\right) \quad 1 \le i \le p. \end{aligned}$$
(8)

proof

Proof

We have to solve for \(1 \le l \le p\):

$$\begin{aligned}&\frac{\partial }{\partial a_l} L(a)\nonumber \\&\quad = \frac{1}{p}\sum _{k=1}^p \varPhi _l(X_k)\nonumber \\&\qquad \left( \sum _{i=1}^p \exp {\left( N \ln {\left[ 1 + \varDelta t \, a_i \right] ^+}\right) } \varPhi _i(X_k) -T(X_k) \right) \nonumber \\&\qquad \frac{\partial }{\partial a_l} \exp {\left( N \ln {\left[ 1 + \varDelta t \, a_l \right] ^+}\right) } =0. \end{aligned}$$
(9)

Thanks to the property of \({\mathbb {P}}_1\)-Lagrange interpolant functions:

$$\begin{aligned} \varPhi _p(X_q)= \left\{ \begin{array}{lr} 1 &{} \text { if } p=q \\ 0 &{} \text { otherwise }, \end{array} \right. \end{aligned}$$

we get for \(1 \le l \le p\):

$$\begin{aligned} \frac{\partial }{\partial a_l} L(a)=0 \text { if }\exp {\left( N \ln {\left[ 1 + \varDelta t \, a_l\right] ^+}\right) } -T(X_l) =0. \end{aligned}$$
(10)

Using asymptotic expansion for the \(\ln (1+ z)^+=\ln (1+ z) + \ln (sgn^+(1+ z))\) function in zero, we get:

$$\begin{aligned} \left( a_l+ \mathop {}\mathopen {}{\mathcal {O}}\mathopen {}\left( \dfrac{1}{N}\right) \right) + N \ln \left( \text {sgn}^+ \left( 1 + \dfrac{a_l}{N}\right) \right) = \ln {T(X_l) }\quad 1 \le l \le p; \end{aligned}$$

where \(sgn^+(z)=\left\{ \begin{array}{lr} 1 &{} \text { if } 0 \le z \\ 0 &{} \text { otherwise } \end{array} \right.\) \(\square\)

The expression of the LNN output 6 becomes:

$$\begin{aligned} Y_{X}= \sum _{i=1}^p\exp {\left( N \ln {\left[ 1 + \varDelta t \left\{ \ln {(T(X_i))} + \mathop {}\mathopen {}{\mathcal {O}}\mathopen {}\left( \dfrac{1}{N}\right) \right\} \right] ^+}\right) } \varPhi _i(X). \end{aligned}$$
(11)

Now, we give the main result of this section.

Theorem 1

Assume that \(I\subset {\mathbb {R}}_+^*\) is compact, \(T\in H^2(I; {\mathbb {R}}_+^*)\), \(\{ X_i\}_{i=1}^p\) are the ordered sampling points in I with \(h=\frac{\vert X_{p} - X_{1}\vert }{ p-1}\), \(N_0= \sup _{1\le i \le p}\vert \ln {T(X_i)}\vert\), and \(Y_{(\cdot )}\) is the output of the LNN (see 11 ). Then, the following asymptotic error estimate holds true for \(N_0 \le N\):

$$\begin{aligned} \Vert T (\cdot ) -Y_{(\cdot )}\Vert _{L^2(I)} \le C\left( h^2 + \frac{1}{N} \right) . \end{aligned}$$
(12)

Proof

Let \(\varPi _h\) denote the \({\mathbb {P}}_1\)-Lagrange interpolant associated with the grid \(\{ X_i\}_{i=1}^p\) for continuous functions defined on I:

$$\begin{aligned} \varPi _h v(z) = \sum _{i=1}^p v(X_i) \varPhi _i(z) \quad \forall z \in I. \end{aligned}$$

From interpolation results in Sobolev’s spaces, we have (see [4] theorem 3.1.4 p. 121):

$$\begin{aligned} \Vert v (\cdot ) -\varPi _h \, v(\cdot )\Vert _{L^2(I)} \le C' h^2 \quad \forall v \in H^2(I). \end{aligned}$$
(13)

For \(N_0 \le N\) and large enough, using an asymptotic expansion for the \(\ln (1+ z)\) function in zero, we get from 11:

$$\begin{aligned} Y_{X}= \sum _{i=1}^p T(X_i)\exp {\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}\left( \dfrac{1}{N}\right) }\varPhi _i(X). \end{aligned}$$

Thus:

$$\begin{aligned} \varPi _h T(X) - Y_{X}= \sum _{i=1}^p \left[ 1-\exp {\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}\left( \dfrac{1}{N}\right) } \right] T(X_i)\varPhi _i(X), \end{aligned}$$

and

$$\begin{aligned} \Vert \varPi _h T(\cdot ) - Y_{\cdot }\Vert _{L^2(I)} \le \dfrac{C_1}{N} \Vert \varPi _h T(\cdot ) \Vert _{L^2(I)} \le \dfrac{C_2}{N} \Vert T(\cdot ) \Vert _{H^2(I)}. \end{aligned}$$

Triangular inequality gives the result. \(\square\)

Remark 1

The proposed LNN is a realization of the time-dependent transport of the Id operator towards the operator T defined by: \(t \mapsto T^t\) for \(0\le t \le 1\), or equivalently: \(Y'(t)=\ln {(T)} Y(t), \, 0< t \le 1; Y(0) =Id.\)

The \(1-d\) Case with Non-equidistant Sampling Points and the Two-Neighbor Signed Distance Strategy

Let \(I \subset {\mathbb {R}}_+^*\) be a compact subset, \(\{ X_i\}_{i=1}^p\) the sampling points in I, \(h= \sup _{2\le i \le p} \vert X_{i} - X_{i-1}\vert\). For \(2\le i \le p-1\), let denote by:

$$\begin{aligned} {{\tilde{\varPhi }}}_i (x) = \left\{ \begin{array}{ll} \dfrac{x-X_{i-1}}{X_{i}-X_{i-1}} &{}\text { if } x \le X_{i} \\ \dfrac{X_{i+1}-x}{X_{i+1}-X_{i}} &{}\text { if } X_{i} \le x \\ \end{array} \right. \end{aligned}$$

the extended global Lagrange basis functions of degree one associated with the grid \(\{ X_i\}_{i=1}^p\). Define the signed distance \(d_S\) to \(\{ X_i\}_{i=1}^p\) by:

$$\begin{aligned} d_S(x) = \underset{1 \le i \le p}{Min }\vert x - {{\tilde{\varPhi }}}_i(x) \vert \, \forall x \in I. \end{aligned}$$

The two nearest neighbors for the signed distance of \(X \in I\) are the two end points of the interval \([X_{i-1}, X{_i}]\) of which X belongs to.

Definition 4

(input contribution for irregular grid) For all \(X \in I\), the p-vector components \(\left( \varPhi _1(X), \ldots , \varPhi _p(X) \right) ^t\) are the X barycentric coordinates of the two nearest neighbors for the signed distance in \(\{ X_i\}_{i=1}^p\) completed by zero.

For all \(X \in I\), the p-vector components \(\left( \varPhi _1(X), \ldots , \varPhi _p(X) \right) ^t\) are the values of the \({\mathbb {P}}_1\) Lagrange basis functions.

Using the same LNN and arguing in the same way as in the case of equidistant sampling points, we get the same convergence result as in Theorem 1.

The \(1-d\) Case with Non-equidistant Sampling Points, Localized Generalized Finite Element, and Two-Nearest-Neighbor Strategy

Here, we are following the method of Generalized finite element introduced by Babuska et all in [1]. Let \(I \subset {\mathbb {R}}_+^*\) be a compact subset, \(\{ X_i\}_{i=1}^p\) the sampling points in I, \(h= \sup _{2\le i \le p} \vert X_{i} - X_{i-1}\vert\). Let us denote by \(\varPhi _i\) the global basis functions of the \({\mathbb {P}}_1\)-Lagrange interpolant associated with the sampling points \(\{ X_i\}_{i=1}^p\). Set \(\omega _i={Supp} (\varPhi _i)\), the \(\varPhi _i\)’s support, for \(i=\{1,p\}\), which are patches satisfying the following properties:

  1. 1.

    \(I =\cup _{i=1}^p \omega _i\); \(\varPhi _i=0\) on \(I\setminus \omega _i\);

  2. 2.

    \(\sum _{i=1}^p \varPhi _i(x)=1\); \(\forall x\in I\);

  3. 3.

    there exists \(0< C_1\) such as \(\underset{x\in I}{\max }\vert \varPhi _i(x) \vert \le 1\); for \(1 \le i \le p.\)

This partition of unity on I has piecewise continuous derivatives. The generalized finite element is defined by the following:

Definition 5

(Generalized finite element) The geometric support of the ith finite element is \(Supp(\varPhi _i)\) and the local polynomial space is \({\mathbb {P}}_1\), the basis of which is \(\{\varphi _1^i, \varphi _2^i\}\), the barycentric coordinates of the two nearest neighbors in \(\{ X_k\}_{k=i-1}^{i+1} \subset Supp(\varPhi _i)\) for \(1\le i \le p\) Fig. 2.

Fig. 2
figure2

\(i^{th}\) patch function

The global approximation space \(V_h\) is given by:

$$\begin{aligned} V_h = span \left\{ \varPhi _i(\cdot ) \sum _{k=1}^2 \varphi _k^i(\cdot ) \right\} _{i=1}^p. \end{aligned}$$

We have:

Lemma 3

For all \(\omega _i\) let denote by \(\varPi _{i} \in {{{\mathcal {L}}}} (H^2(\omega _i); L^2(\omega _i))\) the \({\mathbb {P}}_1\) interpolation operator defined by \(\forall x \in \omega _i\), \(\varPi _{i} v(x) = \sum _{k=1}^2 \varphi _k^i(x) v(X_{i_k}),\) with \(X_{i_1}, X_{i_2}\) the two nearest neighbors in \(\{ X_k\}_{k=i-1}^{i+1}\) . Assume that there exists \(0<M\), such that \(\frac{h}{\underset{1 \le i \le p-1}{Inf }\vert X_{i+1} -X_i \vert } \le M\), and then, there exists \(0< C\) such that for all \(v \in H^2(\omega _i)\) we have the following estimate:

$$\begin{aligned} \Vert v- \varPi _{i} v \Vert _{L^2(\omega _i)} \le C h^2 \vert v \vert _{H^2(\omega _i)}, \end{aligned}$$

where \(\vert \cdot \vert _{H^2(\omega _i)}\)is the \(H^2(\omega _i)\) semi-norm.

Proof

Using the technique of returning to a reference interval \(K_i= [ \frac{X_{i-1}-X_i}{X_{i+1}-X_i} , 1]\) with an affine transformation \({{\hat{F}}}({{\hat{x}}})= X_i + {{\hat{x}}} (X_{i+1}-X_i)\), we set \({\hat{v}}=v\circ {\hat{F}}\) and define an interpolation operator \(\hat{\varPi }_i\), such that \({{\hat{\varPi }}}_i{\hat{v}}=\varPi _iv\circ {\hat{F}}\). Using the change of variable \({\hat{F}}\), we have:

$$\begin{aligned} \Vert v- \varPi _i v \Vert _{L^2(\omega _i)}^2 = \vert X_{i+1}-X_i\vert \Vert {\hat{v}}- {{\hat{\varPi }}}_i {\hat{v}} \Vert _{L^2(K_i)}^2 . \end{aligned}$$

Since \({\hat{v}}- {{\hat{\varPi }}}_i {\hat{v}}\) vanishes for \({\hat{x}} =\{0, 1\}\), using Rolle’s theorem and integrating two times, we get:

$$\begin{aligned} \Vert {\hat{v}}- {{\hat{\varPi }}}_i {\hat{v}} \Vert _{L^2(K_i)}^2 \le C(\vert K_i \vert ) \vert {\hat{v}}- {{\hat{\varPi }}}_i {\hat{v}} \vert _{H^2(K_i)}^2 = C(\vert K_i \vert )\vert {\hat{v}} \vert _{H^2(K_i)}^2, \end{aligned}$$

where \(\vert K_i \vert\) denotes the length of \(K_i\) which is controlled uniformly with respect to the index i using the hypothesis regarding the mesh. Now, we use the change of variable \({\hat{F}}^{-1}\) to get a bound with the Sobolev’s \(H^2\) semi-norm \(\vert \cdot \vert _{H^2(\omega _i)}\) which provides a \(\frac{h^4}{\vert X_{i+1}-X_i\vert }\) factor in the previous inequality, and thus, we get the announced result. \(\square\)

For the global interpolation operator, we have:

Lemma 4

let \(\varPi _h : H^2(I) \longrightarrow V_h \subset L^2(I)\) be defined by:

$$\begin{aligned} \varPi _h v (\cdot )=\sum _{i=1}^p \varPhi _i(\cdot ) \varPi _{i} v(\cdot ), \end{aligned}$$

where \(\varPi _i\) is defined in Lemma 3. Assume that there exists \(0<M\), such that \(\frac{h}{\underset{1 \le i \le p-1}{Inf }\vert X_{i+1} -X_i \vert } \le M\), and then, there exists \(0< C_2\), such that for all \(v \in H^2(I)\), we have the following estimate:

$$\begin{aligned} \Vert v- \varPi _h v \Vert _{L^2(I)} \le C_2 h^2 \vert v \vert _{H^2(I)}, \end{aligned}$$

Proof

$$\begin{aligned} \Vert v- \varPi _h v \Vert _{L^2(I)} ^2=\,\, & {} \Vert v- \sum _{i=1}^p \varPhi _i \varPi _{i} v\Vert _{L^2(I)} ^2\\= & {} \Vert \sum _{i=1}^p \varPhi _i (v- \varPi _{i} v)\Vert _{L^2(I)} ^2. \end{aligned}$$

Using the fact that any \(x\in I\) belongs at most to two subdomains \(\omega _i\), we see that \(\sum _{i=1}^p \varPhi _i (v- \varPi _{i} v)\) has at most two terms for any \(x \in I\). Hence, the Cauchy–Schwarz inequality provides:

$$\begin{aligned} \left| \sum _{i=1}^p \varPhi _i (v- \varPi _{i} v)\right| ^2 \le 2 \sum _{i=1}^p \vert \varPhi _i (v- \varPi _{i} v)\vert ^2. \end{aligned}$$

Then, we have:

$$\begin{aligned} \Vert v- \varPi _h v \Vert _{L^2(I)} ^2\le & {} \int _I \sum _{i=1}^p \vert \varPhi _i (v- \varPi _{i} v)\vert ^2 \, dx \\\le \,\,& {} 2C_1^2 \sum _{i=1}^p \int _{\omega _i} \vert (v- \varPi _{i} v)\vert ^2 \, dx, \end{aligned}$$

which with Lemma 3 gives:

$$\begin{aligned} \Vert v- \varPi _h v \Vert _{L^2(I)} ^2\le & {} 2C_1^2 \sum _{i=1}^p C^2 h^4 \vert v \vert _{H^2(\omega _i)}^2 \\= \,\,& {} 2C_1^2 C^2 h^4 \vert v \vert _{H^2(I)}^2. \end{aligned}$$

\(\square\)

For accounting for the irregularity of the grid \(\{ X_i\}_{i=1}^p\), we have to modify the LNN as follows:

Definition 6

(input contribution irregular case) For all \(X \in I\), the p-vector components \(\left( \psi _1(X), \cdots , \psi _p(X) \right) ^t\) are defined by:

$$\begin{aligned} \psi _i(X)=\varPhi _i(X) \sum _{k=1}^2 \varphi _k^i(X), \end{aligned}$$

where \(\varphi ^i_k(X)\) are the X barycentric coordinates of the two nearest neighbors in \(\{ X_k\}_{k=i-1}^{i+1} \subset Supp(\varPhi _i)\) for \(1\le i \le p\).

Unfortunately, functions \(\psi\) are not always positives, and thus, we modify our algorithm.

Definition 7

(LNN channel) Let \(\psi _i(X)\) be given, and for \(1 \le i \le p\), the LNN i-channel is defined by:

$$\begin{aligned} \left\{ \begin{array}{ll} Y_{i}^{m+1} &{} = (1 + \varDelta t \, a_i)) Y_{k}^{m}\quad 0<m\le n-1 \\ Y_{i}^{0}&{}= \psi _i(X). \end{array} \right. \end{aligned}$$
(14)

We have from 14:

$$\begin{aligned} Y_{i}^{N}= \exp {\left( N \ln {\left[ 1 + \varDelta t \, a_i \right] }\right) } \psi _i(X), \end{aligned}$$
(15)

and gathering all channels contribution provides:

$$\begin{aligned} Y_{X}= \sum _{i=1}^p\exp {\left( N \ln {\left[ 1 + \varDelta t \, a_i \right] }\right) } \psi _i(X). \end{aligned}$$
(16)

Since \(\psi _l(X_q)= \left\{ \begin{array}{lr} 1 &{} \text { if } l=q \\ 0 &{} \text { otherwise } \end{array} \right.\) for \(1 \le l,q \le p\), then Lemma 2 applies and we get \(a_i = \ln T(X_i) + \mathop {}\mathopen {}{\mathcal {O}}\mathopen {}\left( \dfrac{1}{N}\right)\) for \(1\le i\le p\), and finally:

$$\begin{aligned} Y_{X}= \sum _{i=1}^pT(X_i)\exp {\left( \mathop {}\mathopen {}{\mathcal {O}}\mathopen {}\left( \dfrac{1}{N}\right) \right) } \psi _i(X). \end{aligned}$$
(17)

Theorem 2

Assume that \(I\subset {\mathbb {R}}_+^*\) is compact, \(T\in H^2(I; {\mathbb {R}}_+^*)\), \(\{ X_i\}_{i=1}^p\) are an irregular sampling points in I with \(h=\sup _{2 \le i \le p} {\vert X_{i} - X_{i-1}}\vert\), and there exists \(0<M\), such that \(\frac{h}{\underset{1 \le i \le p-1}{Inf }\vert X_{i+1} -X_i \vert } \le M\). Let \(N_0= \sup _{1\le i \le p}\vert \ln {T(X_i)}\vert\) and \(Y_{(\cdot )}\) be the output of the LNN (see 17). Then, the following asymptotic error estimate holds true for \(N_0 \le N\):

$$\begin{aligned} \Vert T (\cdot ) -Y_{(\cdot )}\Vert _{L^2(I)} \le C_3\left( h^2 + \frac{1}{N} \right) . \end{aligned}$$
(18)

Proof

Let \(\varPi _h\) denote the interpolation operator defined in Lemma 4 associated with the grid \(\{ X_i\}_{i=1}^p\) for continuous functions defined on I:

$$\begin{aligned} \varPi _h v(z) = \sum _{i=1}^p v(X_i) \psi _i(z) \quad \forall z \in I. \end{aligned}$$

From Lemma 4, we have:

$$\begin{aligned} \Vert v (\cdot ) -\varPi _h \, v(\cdot )\Vert _{L^2(I)} \le C_2 h^2 \quad \forall v \in H^2(I) \end{aligned}$$
(19)

for \(N_0 \le N\) and large enough; from  16, we have:

$$\begin{aligned} \varPi _h T(X) - Y_{X}= \sum _{i=1}^p \left[ 1-\exp {\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}\left( \dfrac{1}{N}\right) } \right] T(X_i)\varPhi _i(X). \end{aligned}$$

The estimate for \(I -\varPi _h\) reads:

$$\begin{aligned} \Vert \varPi _h T(\cdot ) - Y_{\cdot }\Vert _{L^2(I)} \le \dfrac{C_4}{N} \Vert \varPi _h T(\cdot ) \Vert _{L^2(I)} \le \dfrac{C_5}{N} \Vert T(\cdot ) \Vert _{H^2(I)}. \end{aligned}$$

Triangular inequality yields:

$$\begin{aligned} \Vert T(\cdot ) - Y_{\cdot }\Vert _{L^2(I)} \le \Vert T(\cdot ) - \varPi _h T_{\cdot }\Vert _{L^2(I)} + \Vert \varPi _h T(\cdot ) - Y_{\cdot }\Vert _{L^2(I)}, \end{aligned}$$

which gives the result.

\(\square\)

The Multi-dimensional Case

Let \(\varOmega \subset ({\mathbb {R}}_+^*)^d\) be a bounded Lipschitzian open subset, located at a distance \(0<h_0\) of the boundary of \(({\mathbb {R}}_+^*)^d\), containing the convex hull of the sampling points \(\{ X_i\}_{i=1}^p\). The domain of operator T is \(({\mathbb {R}}_+^*)^d\), and T is assumed to be \(C^2\) with bounded second derivatives.

In this section, it is assumed that the following hypothesis holds true:

  • H5 there exists \(0<h \le h_0\), such that \(\forall X \in \varOmega\), there exists \(d+1\) nearest neighbors of X among \(\{ X_i\}_{i=1}^p\) at a distance at most h (i.e., belonging to the ball B(Xh)) which constitute a non-degenerated d-simplex \(K_X\). Moreover, there exists \(0< {{\underline{\rho }}}\) the radius of the largest ball included in \(K_X\) \(\forall X \in \varOmega\), such that \(\frac{h}{ {{\underline{\rho }}}}\) is bounded.

For the multi-dimensional case, we argue locally. Introduce a local interpolation space for all \(X \in \varOmega\) be fixed.

Definition 8

(local interpolation space) Define \({\mathbb {P}}_1 (B(X, h)) =Span \{\varphi _j \}_{j=1}^{d+1}\) where \(\varphi _j\) denotes the \({\mathbb {P}}_1\)-Lagrange function associated to the j vertex of the d-simplex \(K_X\), \(A_j \in \{ X_i\}_{i=1}^p\). We have \(\forall f \in C^0({\overline{B}}(X, h)\) \(\varPi _h f(\cdot ) = \sum _{j=1}^{d+1}\varphi _j(\cdot ) f(A_j)\).

Now, we give the error estimate obtained with a multi-point Taylor formula see [4] p. 128 or [14] section 6.

Lemma 5

Assume \(u \in C^2({{\overline{\varOmega }}})\), and there exists \(0< C\) for all \(X \in \varOmega\), such that, \(\forall x \in B(X, h)\) we have the following estimate:

$$\begin{aligned} \vert (\varPi _h u - u )(x) \vert \le C \frac{(d+1)}{2} \underset{y\in {{\overline{\varOmega }}}}{Sup} \Vert D^2 u(y) \Vert _{\mathcal{L}^2({\mathbb {R}}^d;{\mathbb {R}})}h^2. \end{aligned}$$
(20)

The hypothesis \(\frac{h}{ {{\underline{\rho }}}}\) remains bounded which is used to have a uniform bound for the \({\mathbb {P}}_1\)-Lagrange functions in B(Xh). Let us give the input contribution to the LNN for the d-dimension case.

Definition 9

(input contribution in d-dimension) For all \(X \in \varOmega\), the p-vector \(\left( \varPhi _1(X), \cdots , \varPhi _p(X) \right) ^t\) are the barycentric coordinates of X relative to \(K_X\), the d-simplex of nearest neighbors of X in \(\{ X_i\}_{i=1}^p\), completed by zero.

The functions \(\varPhi _p(X)\) are not all non-negative, since we have not necessary \(X \in K_X\), and thus, the LNN is given by the following:

Definition 10

Let \(\varPhi _i(X)\) be given, for \(1 \le i \le p\), the LNN i-channel is defined by:

$$\begin{aligned} \left\{ \begin{array}{ll} Y_{i}^{m+1} &{} = (1 + \varDelta t \, a_i)) Y_{k}^{m}\quad 0<m\le n-1 \\ Y_{i}^{0}&{}= \varPhi _i(X). \end{array} \right. \end{aligned}$$
(21)

We have from 21:

$$\begin{aligned} Y_{i}^{N}= \exp {\left( N \ln {\left[ 1 + \varDelta t \, a_i \right] }\right) } \varPhi _i(X), \end{aligned}$$
(22)

and gathering all channels contribution provides:

$$\begin{aligned} Y_{X}= \sum _{i=1}^p\exp {\left( N \ln {\left[ 1 + \varDelta t \, a_i \right] }\right) } \varPhi _i(X). \end{aligned}$$
(23)

Since \(\varPhi _l(X_q)= \left\{ \begin{array}{lr} 1 &{} \text { if } l=q \\ 0 &{} \text { otherwise } \end{array} \right.\) for \(1 \le l,q \le p\), such that \(X_l; X_q\) are vertices of \(K_X\), then Lemma 2 applies, and we get \(a_i = \ln T(X_i) + \mathop {}\mathopen {}{\mathcal {O}}\mathopen {}\left( \dfrac{1}{N}\right)\) for \(1\le i\le p\), such that \(X_i\) is a vertex of \(K_X\), and finally:

$$\begin{aligned} Y_{X}= \sum _{i=1}^{d+1}T(X_i)\exp {\left( \mathop {}\mathopen {}{\mathcal {O}}\mathopen {}\left( \dfrac{1}{N}\right) \right) } \varPhi _i(X). \end{aligned}$$
(24)

Thanks to Lemma 5 and triangular inequality, we have the following error estimate:

Theorem 3

The operator \(T\, ({\mathbb {R}}_+^*)^d \longrightarrow {\mathbb {R}}\) is assumed to be \(C^2\) with bounded second derivatives. Let \(N_0= \sup _{1\le i \le p}\vert \ln {T(X_i)}\vert\) and \(Y_{(\cdot )}\) be the output of the LNN (see 24 ). Then, there exists \(0<C_d\), such that the following asymptotic error estimate holds true for \(N_0 \le N\):

$$\begin{aligned} \vert T (X) -Y_{X}\vert \le C_d\left( h^2 + \frac{1}{N} \right) \quad \forall X \in \varOmega \end{aligned}$$
(25)

Proof

for \(N_0 \le N\) and large enough, from  23, we have:

$$\begin{aligned} \vert \varPi _h T(X) - Y_{X}\vert= & {} \vert \sum _{i=1}^p \left[ 1-\exp {\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}\left( \dfrac{1}{N}\right) } \right] T(X_i)\varPhi _i(X)\vert \\\le & {} \frac{C(K_X)}{N}\Vert T \Vert _{L^\infty ( \varOmega )}. \end{aligned}$$

The estimate for \((\varPi _h-I)T\) reads:

$$\begin{aligned} \vert (\varPi _h T - T )(x) \vert \le C \frac{(d+1)}{2} \underset{y\in \varOmega }{Sup} \Vert D^2 T(y) \Vert _{{{{\mathcal {L}}}}^2({\mathbb {R}}^d;{\mathbb {R}})}h^2. \end{aligned}$$
(26)

Triangular inequality yields:

$$\begin{aligned} \vert T(X) - Y_{X}\vert \le \vert T(X) - \varPi _h T_{X}\vert + \vert \varPi _h T(X) - Y_{X}\vert . \end{aligned}$$

\(\square\)

Remark 2

Introduce the covering metric radius function:

$$\begin{aligned} R(r, \varOmega )= & {} \inf \{ 0< \epsilon , \, \exists x_1, \cdots x_r \in {\mathbb {R}}^d, \text {such that } \varOmega \\&\subset \cup _{i=1}^r B(x_i, \epsilon ) \} \end{aligned}$$

and then, a global interpolation operator can be built as in one dimension case using a partition of unity if we assume that:

  • let \(h=R(r, \varOmega )\), \(\forall x \in \varOmega\), there exists at most \(\chi\) indexes \(i\in \{1, \cdots , r \}\), such that balls \(B(x_i, h) \ni x\);

  • concerning the sampling points \(\{ X_i\}_{i=1}^p\), \(\forall i \in \{1, \cdots , r \}\) the \(d+1\) nearest neighbors of \(x_i\) among \(\{ X_i\}_{i=1}^p\) located at a distance at most h of \(x_i\) constitute a non-degenerated d-simplex \(K_i\). Moreover, there exists \(0< {{\underline{\rho }}}\) the radius of the largest ball included in \(K_i\), \(i\in \{1, \cdots , r \}\), such that \(\frac{h}{ {{\underline{\rho }}}}\) is bounded.

For all \(1 \le i \le r\), define \({\mathbb {P}}_1 (B(x_i, h)) =Span \{\varphi _j^i \}_{j=1}^{d+1}\) where \(\varphi _j^i\) denotes the \({\mathbb {P}}_1\)-Lagrange function associated with the j vertex \(A_j \in \{ X_i\}_{i=1}^p\) of the d-simplex \(K_i \subset B(x_i, h)\). We have \(\forall f \in C^0({\overline{B}}(x_i, h)\) \(\varPi _h^i f(\cdot ) = \sum _{j=1}^{d+1}\varphi _j^i(\cdot ) f(A_j)\). The local interpolation space is \(V_h=\varPi _{i=1}^r {\mathbb {P}}_1 (B(x_i, h))\).

Remark 3

The given error estimates are not optimal, and using the Averaged Taylor polynomial, it is possible to get estimates in Sobolev’s norms.

Some Numerical Results

In this section, some numerical simulations are given to substantiate the theoretical results. For the 1-d case, let \(T\, [1,3] \rightarrow {\mathbb {R}}\) be defined by \(T(x) =3 x^2\). Table 1 gives the \(L^1\)-error computed with a Simpson numerical integration for different numbers of sampling points and for \(N=10^4\) layers of neurons.

Table 1 Error estimates for1-d equidistant sampling points

The order of convergence is \(h^2\) as asserted by the theoretical results.

For the d=800 case, let T be defined by \(T(x) =\sum _{k=1}^{800} x_k^{ (2 + (-1)^k)}\). We consider four cases constituted of 4000 learning examples randomly distributed on an hypercube centered at \(X=(1.1, \cdot 1.1)^t\) the step of which is, respectively, 0.05; 0.1; 0.2; 0.4. The error is given at the point X and at \(X_{K_X}\) the barycentre of the simplex \(K_X\) for \(N=10^5\) layers of neurons Tables 2 and 3.

Table 2 Error estimates for \(d=800\) and 4000 sampling points randomly distributed
Table 3 Error estimates for \(d=800\) and 3000 sampling points randomly distributed

Let us remark that for all computations in dimension \(d=800\), the point \(X\notin K_X\). For the barycenter of the simplex \(K_X\), it can be checked that the order of convergence is \(h^2\) for the two sampling points numbers. At the point X, the order of convergence deteriorates as h becomes too smaller. This comes from the fact that we have very few control regarding the similarity between \(K_X\) and the ball \(B_X\). It can be checked that if the maximum of the barycentric coordinates is too big, regarding the barycentric coordinate of the barycenter \(X_{K_X}\), then the approximation deteriorates.

Conclusion

By relating a peculiar LNN with a time-dependent transport problem, we provide an error estimate for this LNN using \({\mathbb {P}}_1\) Lagrange interpolant. The space and time are shown to have different roles in the error estimate. Thanks to the generalized finite-element method originally presented by I. Babuska, a strategy of nearest neighbors is proved to be efficient for handling the space accuracy in case of one dimension as in case of many dimensions. Moreover, the nearest-neighbor strategy is local and allows to reduce the computational expense of the proposed method.

In d-dimensions, the error estimate presented in Theorem 3 is not optimal, using the Averaged Taylor polynomial, it is possible to get better estimates in Sobolev’s norms.

References

  1. 1.

    Babuska Ivo, Banerjee Uday, Osborn Jhon. Generalized finite element methods-main ideas, results and perspective. Int J Comput Methods. 2004;1(1):67–103.

    Article  Google Scholar 

  2. 2.

    Cao Feilong, Xie Tingfan, Zongben Xu. The estimate for approximation error of neural networks: A constructive approach. Neurocomputing. 2008;71:626–30.

    Article  Google Scholar 

  3. 3.

    Chang B, Meng L, Haber E, Tung F, Begert D. Multi-level residual networks from dynamical systems view.arXiv:1710.10348, (2017).

  4. 4.

    Ciarlet Philippe G. The finite element method for ellipticproblems; Studies in mathematics and its applications: Theory,Methods and Applications. : North-Holland; 1976.

  5. 5.

    Chen TQ, Rubanova Y, Bettencourt J, Duvenaud DK. Neural ordinary differential equations. Advances in Neural Information Processing Systems, 2018;6571-6583.

  6. 6.

    Cuchiero Christa, Larsson Martin, Teichmann Josef. Deep neural networks, generic universal interpolation, and controlled ODEs.arXiv:1908.07838 [math.OC] (2019).

  7. 7.

    Filici Cristian. Error estimation in the neural network solution of ordinary differential equations. Neur Netw. 2010;23:614–7.

    Article  Google Scholar 

  8. 8.

    Goodfellow I, Bengio Y, Courville A. Deep learning, MIT Press, (2016), available from http://www.deeplearningbook.org

  9. 9.

    Guliyev Namig J, Vugar E. Ismailov On the approximation by single hidden layer feedforward neural networks with fixed weights. Neural Networks 98 2018;296-304.

  10. 10.

    He J, Li L, Xu J, Zheng C. ReLU deep neural networks and linear finite elements. arXiv preprint arXiv:1807.03973, (2018).

  11. 11.

    Hornik K. Approximation capabilities of multilayer feedforward networks. Neural Netw. 1991;4(2):251–7.

    MathSciNet  Article  Google Scholar 

  12. 12.

    Larger L, Baylon-Fuentes A, r; Martinenghi VS, Udatov YK, Chembo, Jacquot M. High-Speed Photonic Reservoir Computing Using a Time Delay-Based Architecture: Million Words per second Classification. PHYS. REV. X 7, 2017;011015.

  13. 13.

    LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44.

    Article  Google Scholar 

  14. 14.

    Waldron Shayne. Multipoint Taylor formulæ. Numer. Math. 1998;80:461–94.

    MathSciNet  Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Jerome Pousin.

Ethics declarations

Conflict of Interest:

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Benmansour, K., Pousin, J. Nearest Neighbors Strategy, \({\mathbb {P}}_1\) Lagrange Interpolation, and Error Estimates for a Learning Neural Network. SN COMPUT. SCI. 2, 38 (2021). https://doi.org/10.1007/s42979-020-00409-3

Download citation

Keywords

  • Learning neural network
  • Transport operator
  • \({\mathbb {P}}_1\) Lagrange interpolation
  • Error estimates
  • Nearest-neighbor strategy

Mathematics Subject Classification

  • 65L09
  • 65D05
  • 65G99