1 Introduction

Deep neural networks have achieved great success in many applications like image processing (Krizhevsky et al. 2012), speech recognition (Hinton et al. 2012) and Go games (Silver et al. 2016). However, the reason why deep networks work well in these fields remains a mystery for long time. Different lines of research try to understand the mechanism of deep neural networks from different aspects. For example, a series of work tries to understand how the expressive power of deep neural networks are related to their architecture, including the width of each layer and depth of the network (Telgarsky 2015, 2016; Lu et al. 2017; Liang and Srikant 2016; Yarotsky 2017, 2018; Hanin 2017; Hanin and Sellke 2017). These work shows that multi-layer networks with wide layers can approximate arbitrary continuous function.

Very recently, there emerges a large body of work that study the global convergence of gradient descent (GD) for training neural networks (Li and Liang 2018; Du et al. 2018b; Allen-Zhu et al. 2018a; Du et al. 2018a). In particular, Li and Liang (2018) showed that for a one-hidden-layer network with ReLU activation function using over-parameterization and random initialization, GD and stochastic gradient descent (SGD) can find the global near-optimal solution in polynomial time. Du et al. (2018b) showed that under the assumption that the ReLU Gram matrix is positive definite, randomly initialized GD converges to a globally optimal solution of a one-hidden-layer network with ReLU activation function and quadratic loss function. Beyond shallow neural network, Du et al. (2018a) considered regression problem with square loss function, and proved that under certain assumptions on the initialization and training data, gradient descent is able to converge to the global optimal solution for training deep neural networks. However, Du et al. (2018a) only investigated DNNs with smooth activation functions, which exclude the widely-used ReLU activation function. Moreover, the theoretical results in Du et al. (2018a) heavily rely on the assumption that the smallest eigenvalue of certain deep compositional Gram matrix is bounded below from zero, which does not explicitly tell the dependency on the problem parameter such as the number of training examples n and the number of hidden layers L, and this assumption cannot be verified in practice. Allen-Zhu et al. (2018a) studied the same problem under a different assumption on the training data, and proved that random initialization followed by gradient descent is able to converge to the global optimal solution for training deep neural networks. Besides, Allen-Zhu et al. (2018a) studied the convergence rate of SGD for training deep ReLU network and discussed various extensions to classification problem and various loss functions. However, the assumption on the training data made in Allen-Zhu et al. (2018a) is very stringent, because they require that any two training data points are separated by some constant, but in practice the data from the same class can be arbitrarily close (e.g., due to data augmentation in deep learning). Our work is independent and concurrent to Du et al. (2018a), Allen-Zhu et al. (2018a).Footnote 1

In this paper, we study the optimization properties of gradient-based methods for deep ReLU neural networks, with more realistic assumption on the training data, milder over-parameterization condition and faster convergence rate. In specific, we consider an L-hidden-layer fully-connected neural network with ReLU activation function. Similar to the one-hidden-layer case studied in Li and Liang (2018) and Du et al. (2018b), we study binary classification problem and show that GD can achieve the global minima of the training loss for any \(L \ge 1\), with the aid of over-parameterization and random initialization. The high-level idea of our proof technique is to show that Gaussian random initialization followed by gradient descent generates a sequence of iterates within a small perturbation region centering around the initial weights. In addition, we will show that the empirical loss function of deep ReLU networks has very good local curvature properties inside the perturbation region, which guarantees the global convergence of gradient descent. Compared with the proof technique in Allen-Zhu et al. (2018a), we provide a sharper analysis on the GD algorithm and prove that GD can be guaranteed to have sufficient descent in a larger perturbation region with a larger step size. This leads to a faster convergence rate and a milder condition on the over-paramterization. More specifically, our main contributions are summarized as follows:

  • We establish the global convergence guarantee for training deep ReLU networks in terms of classification problems. Compared with Li and Liang (2018), Allen-Zhu et al. (2018a) our assumption on training data is more reasonable and is often satisfied by real training data. Specifically, we only require that any two data points from different classes are separated by some constant, while Li and Liang (2018) assumes that the data from different classes are sampled from small balls separated by a constant margin, and Allen-Zhu et al. (2018a) requires that any two data points are well separated, even though they belong to the same class.

  • We show that with Gaussian random initialization on each layer, when the number of hidden nodes per layer is at least \(\widetilde{\Omega }\big (n^{14}L^{16}/\phi ^{4}\big )\), GD can achieve zero training error within \({\widetilde{O}}\big (n^5L^3/\phi \big )\) iterations, where \(\phi \) is the data separation distance,Footnote 2n is the number of training examples, and L is the number of hidden layers. This significantly improves the state-of-the-art results by Allen-Zhu et al. (2018a), where the authors proved that GD can converge within \({\widetilde{O}}\big (n^6L^2/\phi ^2\big )\) iterations if the number of hidden nodes per layer is at least \(\widetilde{\Omega }(n^{24}L^{12}/\phi ^8)\). Compared with Du et al. (2018a), our result only has a polynomial dependency on the number of hidden layers, which is much better than their result that has an exponential dependency on the depth for fully connected deep neural networks.

2 Additional related work

Due to the huge amount of literature on deep learning theory, we are not able to include all papers in this big vein here. Instead, we review the following two additional lines of research, which are also related to our work.

One-hidden-layer neural networks with ground truth parameters Recently a series of work (Tian 2017; Brutzkus and Globerson 2017; Li and Yuan 2017; Du et al. 2017; Zhang et al. 2018) studied a specific class of shallow two-layer (one-hidden-layer) neural networks, whose training data are generated by a ground truth network called “teacher network”. This series of work aim to provide recovery guarantee for gradient-based methods to learn the teacher networks based on either the population or empirical loss functions. More specifically, Tian (2017) proved that for two-layer ReLU networks with only one hidden neuron, GD with arbitrary initialization on the population loss is able to recover the hidden teacher network. Brutzkus and Globerson (2017) proved that GD can learn the true parameters of a two-layer network with a convolution filter. Li and Yuan (2017) proved that SGD can recover the underlying parameters of a two-layer residual network in polynomial time. Moreover, Du et al. (2017) proved that both GD and SGD can recover the teacher network of a two-layer CNN with ReLU activation function. Zhang et al. (2018) showed that GD on the empirical loss function can recover the ground truth parameters of one-hidden-layer ReLU networks at a linear rate.

Deep linear networks Beyond shallow one-hidden-layer neural networks, a series of recent work (Hardt and Ma 2016; Kawaguchi 2016; Bartlett et al. 2018; Gunasekar et al. 2018; Arora et al. 2018a, b) focused on the optimization landscape of deep linear networks. More specifically, Hardt and Ma (2016) showed that deep linear residual networks have no spurious local minima. Kawaguchi (2016) proved that all local minima are global minima in deep linear networks. Arora et al. (2018b) showed that depth can accelerate the optimization of deep linear networks. Bartlett et al. (2018) proved that with identity initialization and proper regularizer, GD can converge to the least square solution on a residual linear network with quadratic loss function, while Arora et al. (2018a) proved the same properties for general deep linear networks.

3 Preliminaries

3.1 Notation

We use lower case, lower case bold face, and upper case bold face letters to denote scalars, vectors and matrices respectively. For a positive integer n, we denote \([n] = \{1,\dots ,n\}\). For a vector \(\mathbf{x}= (x_1,\dots ,x_d)^\top \), we denote by \(\Vert \mathbf{x}\Vert _p=\big (\sum _{i=1}^d |x_i|^p\big )^{1/p}\) the \(\ell _p\) norm of \(\mathbf{x}\), \(\Vert \mathbf{x}\Vert _\infty = \max _{i=1,\dots ,d} |x_i|\) the \(\ell _\infty \) norm of \(\mathbf{x}\), and \(\Vert \mathbf{x}\Vert _0 = |\{x_i:x_i\ne 0,i=1,\dots ,d\}|\) the number of non-zero entries of \(\mathbf{x}\). We use \(\text {Diag}(\mathbf{x})\) to denote a square diagonal matrix with the elements of vector \(\mathbf{x}\) on the main diagonal. For a matrix \({\mathbf{A}}= (A_{ij})\in {\mathbb {R}}^{m\times n}\), we use \(\Vert {\mathbf{A}}\Vert _F\) to denote the Frobenius norm of \({\mathbf{A}}\), \(\Vert {\mathbf{A}}\Vert _2\) to denote the spectral norm (maximum singular value), and \(\Vert {\mathbf{A}}\Vert _0\) to denote the number of nonzero entries. We denote by \(S^{d-1} = \{ \mathbf{x}\in {\mathbb {R}}^d:\Vert \mathbf{x}\Vert _2 =1\}\) the unit sphere in \({\mathbb {R}}^d\).

For two sequences \(\{a_n\}\) and \(\{b_n\}\), we use \(a_n = O(b_n)\) to denote that \(a_n\le C_1 b_n\) for some absolute constant \(C_1> 0\), and use \(a_n = \Omega (b_n)\) to denote that \(a_n\ge C_2 b_n\) for some absolute constant \(C_2>0\). In addition, we also use \(\widetilde{O}(\cdot )\) and \(\widetilde{\Omega }(\cdot )\) to hide logarithmic terms in Big-O and Big-Omega notations. We also use the following matrix product notation. For indices \(l_1,l_2\) and a collection of matrices \(\{{\mathbf{A}}_r\}_{r\in {\mathbb {Z}}_+}\), we denote

$$\begin{aligned} \prod _{r = l_1}^{l_2} {\mathbf{A}}_r :=\left\{ \begin{array}{ll} {\mathbf{A}}_{l_2}{\mathbf{A}}_{l_2-1} \cdots {\mathbf{A}}_{l_1} &{} \text {if }l_1\le l_2 \\ {\mathbf{I}}&{} \text {otherwise.} \end{array} \right. \end{aligned}$$
(3.1)

3.2 Problem setup

Let \(\{(\mathbf{x}_1,y_1),\ldots ,(\mathbf{x}_n,y_n)\} \in ({\mathbb {R}}^d\times \{-1,1\})^n\) be a set of n training examples. Let \(m_0 = d\). We consider L-hidden-layer neural networks as follows:

$$\begin{aligned} f_{{\mathbf{W}}}(\mathbf{x}) = {\mathbf{v}}^\top \sigma ( {\mathbf{W}}_{L}^{\top } \sigma ( {\mathbf{W}}_{L-1}^{\top } \cdots \sigma ( {\mathbf{W}}_{1}^{\top } \mathbf{x})\cdots )), \end{aligned}$$

where \(\sigma (x) = \max \{0, x\}\) is the entry-wise ReLU activation function, \({\mathbf{W}}_{l} = ({\mathbf{w}}_{l,1},\ldots ,{\mathbf{w}}_{l,m_l}) \in {\mathbb {R}}^{m_{l-1}\times m_{l}}\), \(l=1,\ldots ,L\) are the weight matrices, and \({\mathbf{v}}\in \{-1,+1\}^{m_L}\) is the fixed output layer weight vector with half 1 and half \(-1\) entries. Let \({\mathbf{W}}=\{{\mathbf{W}}_l\}_{l=1,\dots ,L}\) be the collection of matrices \({\mathbf{W}}_1,\dots ,{\mathbf{W}}_L\), we consider solving the following empirical risk minimization problem:

$$\begin{aligned} L_S({\mathbf{W}}) = \frac{1}{n}\sum _{i=1}^n\ell ( y_i{\widehat{y}}_i) = \frac{1}{n}\sum _{i=1}^n\ell \big ( y_i{\mathbf{v}}^\top \sigma ( {\mathbf{W}}_{L}^{\top } \sigma ( {\mathbf{W}}_{L-1}^{\top } \cdots \sigma ( {\mathbf{W}}_{1}^{\top } \mathbf{x}_i )\cdots ))\big ) \end{aligned}$$
(3.2)

where \({\widehat{y}}_i = f_{{\mathbf{W}}}(\mathbf{x}_i)\) denotes the output of neural network and \(\ell ( x) = \log (1 + \exp (-x))\) is the cross-entropy loss for binary classification.

3.3 Optimization algorithms

In this paper, we consider training a deep neural network with Gaussian initialization followed by gradient descent.

Gaussian initialization We say that the weight matrices \({\mathbf{W}}_{1}, \ldots , {\mathbf{W}}_{L}\) are generated from Gaussian initialization if each column of \({\mathbf{W}}_{l}\) is generated independently from the Gaussian distribution \(N(\mathbf{0},2/m_l {\mathbf{I}})\) for all \(l=1,\ldots ,L\). This initialization mechanism is called He-initialization, which was proposed in He et al. (2015).

Gradient descent We consider solving the empirical risk minimization problem (3.2) with gradient descent with Gaussian initialization: let \( {\mathbf{W}}_{1}^{(0)},\ldots ,{\mathbf{W}}_{L}^{(0)} \) be weight matrices generated from Gaussian initialization, we consider the following gradient descent update rule:

$$\begin{aligned} {\mathbf{W}}_{l}^{(k)} = {\mathbf{W}}_{l}^{(k-1)} - \eta \nabla _{{\mathbf{W}}_l} L_{S}({\mathbf{W}}^{(k-1)}),~l=1,\ldots ,L, \end{aligned}$$

where \(\nabla _{{\mathbf{W}}_l} L_{S}(\cdot )\) is the partial gradient of \(L_{S}(\cdot )\) with respect to the l-th layer parameters \({\mathbf{W}}_l\), and \(\eta >0\) is the step size (a.k.a., learning rate).

3.4 Calculations for neural network functions

Here we briefly introduce some useful notations and provide some basic calculations regarding the neural network in our setting.

  • Output after the l-th layer: Given an input \(\mathbf{x}_i\), the output of the neural network after the l-th layer is

    $$\begin{aligned} \mathbf{x}_{l,i}&= \sigma ( {\mathbf{W}}_{l}^{\top } \sigma ( {\mathbf{W}}_{l-1}^{\top } \cdots \sigma ( {\mathbf{W}}_{1}^{\top } \mathbf{x}_i )\cdots ))\\&=\left( \prod _{r=1}^l\varvec{\Sigma }_{r,i}{\mathbf{W}}_{r}^{\top }\right) \mathbf{x}_i, \end{aligned}$$

    where \(\varvec{\Sigma }_{1,i} = \mathrm {Diag}\big ( {{\,\mathrm{\mathbb {1}}\,}}\{ {\mathbf{W}}_1^{\top } \mathbf{x}_i > 0 \} \big )\),Footnote 3 and \(\varvec{\Sigma }_{l,i} = \mathrm {Diag}[{{\,\mathrm{\mathbb {1}}\,}}\{ {\mathbf{W}}_{l}^{\top } (\prod _{r=1}^{l-1}\varvec{\Sigma }_{r,i}{\mathbf{W}}_{r}^{\top }) \mathbf{x}_i > 0 \} ]\) for \(l=2,\ldots ,L\).

  • Output of the neural network: The output of the neural network with input \(\mathbf{x}_i\) is as follows:

    $$\begin{aligned} f_{{\mathbf{W}}}(\mathbf{x}_i)&= {\mathbf{v}}^\top \sigma ( {\mathbf{W}}_{L}^{\top } \sigma ( {\mathbf{W}}_{L-1}^{\top } \cdots \sigma ( {\mathbf{W}}_{1}^{\top } \mathbf{x}_i )\cdots ))\\&= {\mathbf{v}}^\top \left( \prod _{r=l}^L\varvec{\Sigma }_{r,i}{\mathbf{W}}_{r}^{\top }\right) \mathbf{x}_{l-1,i}, \end{aligned}$$

    where we define \(\mathbf{x}_{0,i} = \mathbf{x}_i\) and the last equality holds for any \(l\ge 1\).

  • Gradient of the neural network: The partial gradient of the training loss \(L_S({\mathbf{W}})\) with respect to \({\mathbf{W}}_l\) is as follows:

    $$\begin{aligned} \nabla _{{\mathbf{W}}_l}L_S({\mathbf{W}}) = \frac{1}{n}\sum _{i=1}^n\ell '(y_i{\widehat{y}}_i)\cdot y_i\cdot \nabla _{{\mathbf{W}}_l}[f_{\mathbf{W}}(\mathbf{x}_i)], \end{aligned}$$

    where the gradient of the neural network function is defined as

    $$\begin{aligned} \nabla _{{\mathbf{W}}_l}[ f_{{\mathbf{W}}}(\mathbf{x}_i)] =\mathbf{x}_{l-1,i}{\mathbf{v}}^\top \left( \prod _{r=l+1}^L\varvec{\Sigma }_{r,i}{\mathbf{W}}_{r}^\top \right) \varvec{\Sigma }_{l,i}. \end{aligned}$$

    In the remaining of this paper, we define the gradient \(\nabla L_S({\mathbf{W}})\) as the collection of partial gradients with respect to all \({\mathbf{W}}_l\)’s, i.e.,

    $$\begin{aligned} \nabla L_S({\mathbf{W}}) = \{\nabla _{{\mathbf{W}}_1} L_S({\mathbf{W}}),\nabla _{{\mathbf{W}}_2} L_S({\mathbf{W}}),\ldots ,\nabla _{{\mathbf{W}}_L} L_S({\mathbf{W}})\}. \end{aligned}$$

    We also define the Frobenius norm of \(\nabla L_S({\mathbf{W}})\) as

    $$\begin{aligned} \Vert \nabla _{{\mathbf{W}}_l} L_S({\mathbf{W}}) \Vert _F = \left[ \sum _{l=1}^L\Vert \nabla _{{\mathbf{W}}_l} L_S({\mathbf{W}}) \Vert _F^2 \right] ^{1/2}. \end{aligned}$$

4 Main theory

In this section, we show that with random Gaussian initialization, over-parameterization helps gradient descent converge to the global minimum, i.e., find a point in the parameter space with arbitrary small training loss. We start with assumptions on the training data,

Assumption 4.1

\(\Vert \mathbf{x}_i \Vert _2 = 1\) and \((\mathbf{x}_i)_d = \mu \) for all \(i\in \{1,\ldots ,n\}\), where \(\mu \in ( 0, 1)\) is a constant.

As is shown in the assumption above, the last entry of input \(\mathbf{x}\) is considered to be a constant \(\mu \). This assumption is natural because it can be seen as adding a bias term in the input layer, and learning both weight vector and bias is equivalent to adding an additional dummy variable (\((\mathbf{x}_i)_d = \mu \)) to all input vectors and learning the weight vector only. The same assumption has been made in Allen-Zhu et al. (2018a). In addition, we emphasize that Assumption 4.1 is made in order to simplify the proof. Actually, rather than restricting the norm of all training examples to be 1, this assumption can be relaxed to be that \(\Vert \mathbf{x}_i\Vert _2\) is lower and upper bounded by some constants.

Assumption 4.2

For all \(i,i'\in \{1,\ldots ,n\}\), if \(y_i \ne y_{i'}\), then \(\Vert \mathbf{x}_i - \mathbf{x}_{i'} \Vert _2 \ge \phi \) for some \(\phi >0\).

Assumption 4.2 basically requires that inputs with different labels in the training data are separated from each other by at least a constant. This assumption is often satisfied in practice. In contrast, Allen-Zhu et al. (2018a) assumes that every two different data points in the training data are separated by a constant, which is much stronger and cannot be satisfied since in classification it is allowed that the data with the same label can be arbitrarily close.

Furthermore, Assumption 4.2 can be easily verified based on the training data. As a comparison, the assumption made in Du et al. (2018a) assumes that certain deep compositional Gram matrix defined on the training data is strictly positive definite, which is not easy to verify, since the definition of their special Gram matrix is based on integration.

Then we have the following assumption on the structure of neural network.

Assumption 4.3

Define \(M = \max \{ m_1,\ldots , m_L\}\), \(m = \min \{ m_1,\ldots , m_L\}\). We assume that \(M \le 2m\).

Assumption 4.3 states that the number of nodes at all layers are of the same order. The constant 2 is not essential and can be replaced with an arbitrary constant greater than or equal to 1.

Under Assumptions 4.14.3, we are able to establish the global convergence of gradient descent for training deep ReLU networks. Specifically, we provide the following theorem which characterizes the required numbers of hidden nodes and iterations such that the gradient descent can attain the global minimum of the training loss function.

Theorem 4.4

Suppose \({\mathbf{W}}_1^{(0)}, \dots , {\mathbf{W}}_L^{(0)}\) are generated by Gaussian initialization. Then under Assumptions 4.14.3, if the step size \(\eta = O(M^{-1}L^{-3})\), the number of hidden nodes per layer satisfies

$$\begin{aligned} m=\widetilde{\Omega }\big (n^{14}L^{16}\phi ^{-4} + n^{12}L^{16}\phi ^{-4}\epsilon ^{-1}\big ) \end{aligned}$$

and the maximum number of iteration satisfies

$$\begin{aligned} K= {\widetilde{O}}\big (n^5L^3/\phi +n^{3}L^3\epsilon ^{-1}/\phi \big ), \end{aligned}$$

then with high probability, the last iterate of gradient descent \({\mathbf{W}}^{(K)}\) satisfies \(L_S({\mathbf{W}}^{(K)})\le \epsilon \).

Remark 4.1

Note that our bound on the required number of hidden nodes per layer, i.e., m, depends on the target accuracy \(\epsilon \). However, in practical classification tasks, we are more interested in finding some points with zero training error. In specific, the cross-entropy loss \(\ell (x) = \log (1 + \exp (-x))\) is strictly decreasing in x, thus \(\ell (y_i{\widehat{y}}_i)\le \ell (0) = \log (2) \) implies \(y_i{\widehat{y}}_i \ge 0\). If we set \(L_S({\mathbf{W}})\le \ell (0)/n = \log (2)/n\), it holds that \(\ell (y_i{\widehat{y}}_i)\le nL_S({\mathbf{W}})\le \ell (0)\) for all \(i\in [n]\), which further implies that \(y_i{\widehat{y}}_i \ge 0\) for all \(i\in [n]\), i.e., all training data are correctly classified. Therefore, Theorem 4.4 implies that gradient descent can find a point with zero training error if the number of hidden nodes per layer is at least \(m=\widetilde{\Omega }(n^{14}L^{16}\phi ^{-4})\).

Remark 4.2

Here we compare our theoretical results with those in Allen-Zhu et al. (2018a) and Du et al. (2018a). Specifically, Allen-Zhu et al. (2018a) proved that gradient descent can achieve zero training error within \(O(n^6L^2/\phi ^2)\) iterations under the condition that the neural network width is at least \(m = \widetilde{\Omega }(n^{24}L^{12}/\phi ^{8})\). As a clear comparison, our result on m is significantly better by a factor of \(\widetilde{\Omega }(n^{10}L^{-4}/\phi ^4)\), and our convergence rate is faster by a factor of \(O(nL^{-1})\).Footnote 4 On the other hand, Du et al. (2018a) proved similar global convergence result when the neural network width is at least \(\widetilde{\Omega }\big (2^{O(L)}\cdot n^4/\lambda _0^4\big )\), where \(\lambda _0\) is the smallest eigenvalue of the deep compositional Gram matrix defined in their paper. Compared with their result, our condition on m has significantly better dependency in L. In addition, for real training data, \(\lambda _0\) can have high degree dependency on the reciprocal of the sample size n, which makes the dependency of their result on n much worse.

5 Proof of the main theory

In this section, we provide the proof of the main theory. In specific, we decompose the proof into three steps:

Step 1: We characterize a perturbation region at the initialization, and prove that the neural network attains good properties within such region.

Step 2: Based on the assumption that all iterates are staying inside the region \({\mathcal {B}}({\mathbf{W}}^{(0)},\tau )\), we establish the convergence results of gradient descent.

Step 3: We verify that with our choice of m, until convergence all iterates of gradient descent would not escape from the perturbation region \({\mathcal {B}}({\mathbf{W}}^{(0)},\tau )\), which justifies the derived convergence guarantee.

Now we characterize the perturbation as follows. Given the initialization generated by Gaussian distribution \({\mathbf{W}}^{(0)}: = \{{\mathbf{W}}_l^{(0)}\}_{l = 1,\dots , L}\), we define by \({\mathcal {B}}({\mathbf{W}}^{(0)},\tau ) = \{{\mathbf{W}}: \Vert {\mathbf{W}}_l - {\mathbf{W}}_l^{(0)}\Vert _2\le \tau \text{ for } \text{ all } l\in [L]\}\) the perturbation region centered at \({\mathbf{W}}^{(0)}\). Then we provide the following Lemmas that provides key results which are essential to establish the convergence guarantees for (stochastic) gradient descent.

Lemma 5.1

(Bounded initial training loss) Under Assumptions 4.1 and 4.3, with probability at least \(1-\delta \), at the initialization the training loss satisfies \(L_S({\mathbf{W}}^{(0)})\le C\sqrt{\log (n/\delta )}\).

Next we are going to state the following key lemmas that characterize some essential properties of the neural network when its weight parameters satisfy \({\mathbf{W}}\in {\mathcal {B}}({\mathbf{W}}^{(0)},\tau )\). Firstly, the following lemma provides the lower and upper bounds of the Frobenious norm of the partial gradient \(\nabla _{{\mathbf{W}}_l} [L_S({\mathbf{W}})]\).

Lemma 5.2

(Gradient lower and upper bound) Under Assumptions 4.1, 4.2, and 4.3, if \(\tau = O\big (\phi ^{3/2}n^{-3}L^{-2}\big ) \) and \(m = \widetilde{\Omega }(n^2\phi ^{-1})\), then for all \(\widetilde{{\mathbf{W}}} \in {\mathcal {B}}({\mathbf{W}}^{(0)},\tau )\), with probability at least \(1- \exp \big (-O(m\phi /n)\big )\), there exist positive constants C and \(C'\) such that

$$\begin{aligned} \Vert \nabla _{{\mathbf{W}}_L}[L_S(\widetilde{{\mathbf{W}}})]\Vert _F^2&\ge C \frac{m\phi }{n^5}\left( \sum _{i=1}^n\ell '(y_i \widetilde{y}_i)\right) ^2,\nonumber \\ \Vert \nabla _{{\mathbf{W}}_l}[L_S(\widetilde{{\mathbf{W}}})]\Vert _F&\le -\frac{C'L M^{1/2}}{n}\sum _{i=1}^n\ell '(y_i {\widetilde{y}}_i), \end{aligned}$$

for all \(l\in [L]\), where \( {\widetilde{y}}_i = f_{ \widetilde{{\mathbf{W}}}}(\mathbf{x}_i)\).

Then we provide the following lemma that characterizes the training loss decreasing after one-step gradient descent.

Lemma 5.3

(Sufficient descent) Let \({\mathbf{W}}_1^{(0)},\dots ,{\mathbf{W}}_L^{(0)}\) be generated via Gaussian random initialization. Let \({\mathbf{W}}^{(k)} = \{{\mathbf{W}}_l^{(k)}\}_{l=1,\dots , L}\) be the k-th iterate in the gradient descent and \(\tau = O(L^{-11}\log ^{-3/2}(M))\). If \({\mathbf{W}}^{(k)}, {\mathbf{W}}^{(k+1)}\in {\mathcal {B}}({\mathbf{W}}^{(0)},\tau )\), then there exist constants \(C'\) and \(C''\) such that with probability at least \(1-\exp \big (-O(m\phi /n)\big )\) the following holds,

$$\begin{aligned} L_S({\mathbf{W}}^{(k+1)}) - L_S({\mathbf{W}}^{(k)})&\le - \big (\eta - C'ML^3\eta ^2\big )\Vert \nabla L_S({{\mathbf{W}}^{(k)}})\Vert _F^2 \\&-\frac{ C''L^{8/3}\tau ^{1/3}\sqrt{M\log (M)}\cdot \eta \Vert \nabla L_S({\mathbf{W}}^{(k)})\Vert _F}{n} \sum _{i=1}^n\ell '(y_i{\widehat{y}}_i^{(k)}) \end{aligned}$$

The second term on the R.H.S. of the result in Lemma 5.3 is due to the non-smoothness of ReLU activation, which can be characterized by counting how many nodes would change their activation patterns during the training process. Clearly, in order to guarantee that the gradient descent can bring sufficient descent in each step, we require the radius \(\tau \) to be sufficiently small. In the following, we are going to complete the proof of Theorem 4.4 based on Lemmas 5.15.3.

Proof of Theorem 4.4

We first prove that GD is able to achieve \(\epsilon \) training loss under the condition that all iterates are staying inside the perturbation region \({\mathcal {B}}({\mathbf{W}}^{(0)},\tau )\). Note that by Lemma 5.2, we know that there exists a constant \(c_0\) such that

$$\begin{aligned} {\Vert \nabla L_S({\mathbf{W}}^{(k)})\Vert _F^2}\ge \Vert \nabla _{{\mathbf{W}}_L}[L_S({\mathbf{W}}^{(k)})]\Vert _F^2\ge \frac{c_0 m\phi }{n^5}\left( \sum _{i=1}^n\ell '(y_i{\widehat{y}}_i^{(k)})\right) ^2. \end{aligned}$$

We set the radius \(\tau \) and the step size \(\eta \) as follows,

$$\begin{aligned} \tau&{=} \left( \frac{c_0^{1/2}m^{1/2}\phi ^{1/2}}{4C''L^{8/3}n^{3/2}\sqrt{M\log (M)}}\right) ^3 = {\widetilde{O}}(n^{-9/2}L^{-8}\phi ^{3/2}), \nonumber \\ \eta&{=} \frac{1}{4C'ML^3} = O(M^{-1}L^{-3}). \end{aligned}$$

Then we have

$$\begin{aligned}&L_S({\mathbf{W}}^{(k+1)})-L_S({\mathbf{W}}^{(k)})\nonumber \\&\quad \le -\frac{3\eta }{4}\Vert \nabla L_S({\mathbf{W}}^{(k)})\Vert _F^2 - \frac{c_0\eta m^{1/2}\phi ^{1/2}}{4n^{5/2}}\Vert \nabla L_S({\mathbf{W}}^{(k)})\Vert _F\cdot \sum _{i=1}^n \ell '(y_i{\widehat{y}}_i^{(k)})\nonumber \\&\quad \le -\frac{\eta }{2}\Vert \nabla L_S({\mathbf{W}}^{(k)})\Vert _F^2\nonumber \\&\quad \le -\eta \frac{c_0m\phi }{2n^5} \left( \sum _{i=1}^n\ell '(y_i{\widehat{y}}_i^{(k)})\right) ^2, \end{aligned}$$
(5.1)

where the first inequality is by Lemma 5.3 and the choices of \(\eta \) and \(\tau \), the second inequality follows from Lemma 5.2, and the last inequality is due to the gradient lower bound we derived above. Note that \(\ell (x) = \log (1+\exp (-x))\), which satisfies \(-\ell '(x) = 1/(1+\exp (x))\ge \min \big \{\alpha _0,\alpha _1\ell (x)\big \}\) where \(\alpha _0 = 1/2\) and \(\alpha _1 = 1/(2\log (2))\). This implies that

$$\begin{aligned} -\sum _{i=1}^n\ell '(y_i{\widehat{y}}_i^{(k)})\ge \min \bigg \{\alpha _0, \sum _{i=1}^n \alpha _1\ell (y_i{\widehat{y}}_i^{(k)})\bigg \}\ge \min \big \{\alpha _0, n\alpha _1L_S({\mathbf{W}}^{(k)})\big \}. \end{aligned}$$

Note that \(\min \{a,b\}\ge 1/(1/a+1/b)\), we have the following by plugging the above inequality into (5.1)

$$\begin{aligned} L_S({\mathbf{W}}^{(k+1)}) - L_S({\mathbf{W}}^{(k)})&\le -\eta \min \bigg \{\frac{c_0m\phi \alpha _0^2}{2n^5},\frac{c_0m\phi \alpha _1^2}{2n^{3}}L_S^2({\mathbf{W}}^{(k)})\bigg \}\nonumber \\&\le -\eta \bigg (\frac{2n^5}{c_0m\phi \alpha _0^2}+\frac{2n^3}{c_0m\phi \alpha _1^2L_S^2({\mathbf{W}}^{(k)})}\bigg )^{-1}. \end{aligned}$$

Rearranging terms gives

$$\begin{aligned} \frac{2n^5}{c_0m\phi \alpha _0^2}\big (L_S({\mathbf{W}}^{(k+1)}) - L_S({\mathbf{W}}^{(k)})\big )+\frac{2n^{3}\big (L_S({\mathbf{W}}^{(k+1)}) -L_S({\mathbf{W}}^{(k)})\big )}{c_0m\phi \alpha _1^2L_S^{2}({\mathbf{W}}^{(k)})}\le -\eta . \end{aligned}$$
(5.2)

Applying the inequality \((x-y)/y^2\ge y^{-1}-x^{-1}\) and taking telescope sum over k give

$$\begin{aligned} k\eta&\le \frac{2n^5}{c_0m\phi \alpha _0^2}\big (L_S({\mathbf{W}}^{(0)})-L_S({\mathbf{W}}^{(k)})\big )+\frac{2n^{3}\big (L_S^{-1}({\mathbf{W}}^{(k)}) -L_S^{-1}({\mathbf{W}}^{(0)})\big )}{c_0m\phi \alpha _1^2}\nonumber \\&\le \frac{2n^5}{c_0m\phi \alpha _0^2}L_S({\mathbf{W}}^{(0)})+\frac{2n^{3}\big (L_S^{-1}({\mathbf{W}}^{(k)}) -L_S^{-1}({\mathbf{W}}^{(0)})\big )}{c_0m\phi \alpha _1^2}. \end{aligned}$$
(5.3)

Now we need to guarantee that after K gradient descent steps the loss function \(L_S({\mathbf{W}}^{(K)})\) is smaller than the target accuracy \(\epsilon \). By Lemma 5.1, we know that the training loss \(L_S({\mathbf{W}}^{(0)}) = {\widetilde{O}}(1)\). Therefore, by (5.3) and our choice of \(\eta \), the maximum iteration number K satisfies

$$\begin{aligned} K = {\widetilde{O}}\big (n^5L^3/\phi +n^{3}L^3\epsilon ^{-1}/\phi \big ). \end{aligned}$$
(5.4)

Then we are going to verify the condition that all iterates stay inside the perturbation region \({\mathcal {B}}({\mathbf{W}}^{(0)},\tau )\). We prove this by induction. Clearly, \({\mathbf{W}}^{(0)}\in {\mathcal {B}}({\mathbf{W}}^{(0)},\tau )\). Then we are going to prove \({\mathbf{W}}^{(k+1)}\in {\mathcal {B}}({\mathbf{W}}^{(0)},\tau )\) under the induction hypothesis that \({\mathbf{W}}^{(t)}\in {\mathcal {B}}({\mathbf{W}}^{(0)},\tau )\) holds for all \(t\le k\). According to (5.1), we have

$$\begin{aligned} L_S({\mathbf{W}}^{(t+1)}) -L_S({\mathbf{W}}^{(t)})\le -\frac{\eta }{2}\Vert \nabla L_S({\mathbf{W}}^{(t)})\Vert _F^2, \end{aligned}$$
(5.5)

for any \(t< k\). Therefore, by triangle inequality, we have

$$\begin{aligned} \Vert {\mathbf{W}}_l^{(k)} - {\mathbf{W}}_l^{(0)}\Vert _2&\le \eta \sum _{t=0}^{k-1} \big \Vert \nabla _{{\mathbf{W}}_l}[L_S({\mathbf{W}}^{(t)})]\big \Vert _2\nonumber \\&\le \eta \sqrt{k\sum _{t=0}^{k-1}\big \Vert \nabla L_S({\mathbf{W}}^{(t)})\big \Vert _F^2}\nonumber \\&\le \sqrt{2k\eta \sum _{t=0}^{k-1}\big [{L_S({\mathbf{W}}^{(t)}) - L_S({\mathbf{W}}^{(t+1)})}\big ]}\nonumber \\&\le \sqrt{2k\eta L_S({\mathbf{W}}^{(0)})}. \end{aligned}$$

By Lemma 5.1, we know that \(L_S({\mathbf{W}}^{(0)}) = {\widetilde{O}}(1)\). Then applying our choices of \(\eta \) and K, we have

$$\begin{aligned} \Vert {\mathbf{W}}_l^{(k)}-{\mathbf{W}}_l^{(0)}\Vert _2\le \sqrt{2K\eta L_S({\mathbf{W}}^{(0)})} = \widetilde{O}\big (n^{5/2}\phi ^{-1/2}m^{-1/2}+n^{3/2}\epsilon ^{-1/2}\phi ^{-1/2}m^{-1/2}\big ). \end{aligned}$$

In addition, by Lemma 5.2 and our choice of \(\eta \), we have

$$\begin{aligned} \eta \Vert \nabla _{{\mathbf{W}}_l}[L_S({\mathbf{W}}^{(k)})]\Vert _2&\le -\frac{\eta C'LM^{1/2}}{n}\sum _{i=1}^n \ell '\big (y_i\cdot f_{{\mathbf{W}}^{(k)}}(\mathbf{x}_i)\big ) \nonumber \\&\le {\widetilde{O}}(L^{-2}M^{-1/2}), \end{aligned}$$

where the second inequality follows from the choice of \(\eta \) and the fact that \(-1\le \ell '(\cdot )\le 0 \). Then by triangle inequality, we have

$$\begin{aligned} \Vert {\mathbf{W}}_l^{(k+1)}-{\mathbf{W}}_l^{(0)}\Vert _2&\le \eta \Vert \nabla _{{\mathbf{W}}_l}[L_S({\mathbf{W}}^{(k)})]\Vert _2 + \Vert {\mathbf{W}}_l^{(k)}-{\mathbf{W}}_l^{(0)}\Vert _2\nonumber \\&={\widetilde{O}}(n^{-9/2}L^{-8}\phi ^{3/2}), \end{aligned}$$

which is exactly in the same order of \(\tau \), where the last equality follows from the over-parameterization assumption \(m = \widetilde{\Omega }\big (n^{14}L^{16}\phi ^{-4}+n^{12}L^{16}\phi ^{-4}\epsilon ^{-1}\big )\). This verifies that \({\mathbf{W}}^{(k+1)}\in {\mathcal {B}}({\mathbf{W}}^{(0)},\tau )\) and completes the induction for k. Thus we can complete the proof. \(\square \)

6 Experiments

In this section we carry out experiments on two real datasets (MNIST LeCun et al. 1998 and CIFAR10 Krizhevsky 2009) to support our theory. Since we mainly focus on binary classification, we extract a subset with digits 3 and 8 from the original MNIST dataset, which consists of 9, 943 training examples. In addition, we also extract two classes of images (“cat” and “ship”) from the original CIFAR10 dataset, which consists of 7, 931 training examples. Regarding the neural network architecture, we use a fully-connected deep ReLU network with \(L =15\) hidden layers, each layer has width m. The network architecture is consistent with the setting of our theory.

Fig. 1
figure 1

The convergence of GD for training deep ReLU network with different network widths. a MNIST dataset. b CIFAR10 dataset

We first demonstrate that over-parameterization indeed helps optimization. We run GD for training deep ReLU networks with different network widths and plot the training loss in Fig. 1, where we apply cross-entropy loss on both MNIST and CIFAR10 datasets. In addition, the step sizes are set to be small enough and fixed for ReLU networks with different width. It can be observed that over-parameterization indeed speeds up the convergence of gradient descent, which is consistent with Lemmas 5.2 and 5.3, since the square of gradient norm scales with m, which further implies that wider network leads to larger function decrease if the step size is fixed. We also display the distance between the iterates of GD and the initialization in Fig. 2. It shows that when the network becomes wider, GD is more likely to converge to a point closer to the initialization. This suggests that the iterates of GD for training an over-parameterized deep ReLU network are harder to exceed the required perturbation region, thus can be guaranteed to converge to a global minimum. This corroborates our theory.

Finally, we monitor the activation pattern changes of all hidden neurons during the training process, and show the results in Fig. 3, where we use cross-entropy loss on both MNIST and CIFAR10 datasets. Specifically, in each iteration, we compare the activation status of all hidden nodes regarding all inputs with those at the initialization, and compute the number of nodes whose activation status differs from that at the initialization. From Fig. 3 it is clear that the activation pattern difference ratio dramatically decreases as the neural network becomes wider, which brings less non-smoothness during the training process. This implies that wider ReLU network can better guarantee sufficient function decrease after one-step gradient descent, which is consistent with our theory.

Fig. 2
figure 2

Distance between the iterates of GD and the initialization. a MNIST dataset. b CIFAR10 dataset

Fig. 3
figure 3

Activation pattern difference ratio between iterates of GD and the initialization. a MNIST dataset. b CIFAR10 dataset

7 Conclusions and future work

In this paper, we studied training deep neural networks by gradient descent. We proved that gradient descent can achieve global minima of the training loss for over-parameterized deep ReLU networks with random initialization, with milder assumption on the training data. Compared with the state-of-the-art results, our theoretical guarantees are sharper in terms of both over-parameterization condition and convergence rate. Our result can also be extended to stochastic gradient descent (SGD) and other loss functions (e.g., square hinge loss and smoothed hinge loss). Such extensions can be found in the longer version of this paper (Zou et al. 2018). In the future, we will further improve the over-parameterization condition such that it is closer to width of neural networks used in practice. Our proof technique can also be extended to other neural network architectures including convolutional neural networks (CNNs) (Krizhevsky et al. 2012), residual networks (ResNets) (He et al. 2016) and recurrent neural networks (RNNs) (Hochreiter and Schmidhuber 1997), and give sharper over-parameterization conditions than existing results for CNNs, ResNets (Du et al. 2018a; Allen-Zhu et al. 2018a) and RNNs (Allen-Zhu et al. 2018b). Moreover, it is also interesting to explore how our optimization guarantees of over-parameterized neural networks can be integrated with existing universal approximation ability results such as Hornik (1991), Telgarsky (2016), Lin and Jegelka (2018), Zhou (2019).