1 Introduction

In recent years, deep learning [11] (also known as deep neural networks (DNNs)) has made various breakthroughs in numerous areas such as speech recognition, text processing, and image processing. A variety of open-source projects, including Caffe [7], MXNet [3], and TensorFlow [1], have continuously been presented. Various deep learning network models, such as AlexNet [10], GoogLeNet [14], and ResNet-50 [5], are also springing up.

To train a good deep learning network model, large amounts of time and energy are necessary (e.g. [10, 14]). If we can parallelize the training process of network models, substantial time and energy savings can be achieved. This is a matter worth investigating further. Currently, there are two main strategies for parallelizing DNNs: data parallelism (e.g. [3]) and model parallelism (e.g. [1]). Both data parallelism and model parallelism can be sped up by increasing the number of training workers. This coarse-grained parallelism strategy significantly accelerates the training of network models, but certain flaws remain. First, communications for synchronizing between multiple training workers represent a large part of the time required during the training process. Therefore, it is difficult to achieve linear acceleration. Actually, the training speed of a single training worker is decreased instead of being increased. Second, GPUs are not fully utilized, especially for smaller datasets. A large amount of GPU computation and memory resources are often left idle. Additionally, this coarse-grained parallelism strategy is difficult to extend for many researchers who want to use deep learning systems. The main reasons include the expensive GPUs and the technical difficulties in extending the systems.

To address the above problems facing the two parallelism strategies and exploit the potential of a single GPU fully, we propose a new fine-grained layer-wise parallelism strategy (named FiLayer), which includes inter-layer and intra-layer parallelisms. Both of them aim to parallelize the training of network models at the layer level inside a GPU-based training worker. The benefit of FiLayer is twofold. First, instead of adding more hardware resources such as more training workers that speed up the model training through data and model parallelisms, it fully exploits the parallelism potential within network and improves the training speed in a single GPU. Second, it has good compatibility with other parallelism methods. In other words, FiLayer can be used in situations characterized by data parallelism and model parallelism with minimal modification and achieve further speedups. CUDA stream technology is used here to implement the parallelism strategies. We design and implement FiLayer by extending Caffe, and name the FiLayer-based system as LP-Caffe. The experimental results show that speedups of \(1.58{\times }\)\(2.19{\times }\) are achieved by FiLayer compared with Caffe. The contributions of this study can be summarized as follows.

  • A fine-grained layer-wise concept for DNN parallelism, which is a meaningful extension of data and model parallelism. CUDA streams are applied to realize the concept.

  • An inter-layer parallelism strategy for adjacent layers of deep neural network models. The mini-batch for a model is split into several fragments so that different fragments from different mini-batches can be processed by different adjacent layers in a pipeline manner.

  • An intra-layer parallelism strategy for convolution layers. The strategy divides the operations in one convolution layer into several parts and runs them in parallel.

2 Related Works

Numerous works on training neural network models in parallel have been conducted by researchers. Most of them achieve their acceleration by coarse-grained data and model parallelism strategies and by more acceleration hardwares. The following are some representative works.

FireCaffe [6] trains GoogLeNet and Network-in-Network (NIN) [8] on ImageNet [13] on a cluster of 128 GPUs, and achieves a speedup of \(47{\times }\) and \(39{\times }\) respectively, compared with the original Caffe. Ammar et al. design S-Caffe system [2] that uses data parallelism to train GoogLeNet on a GPU cluster. It achieves a speedup of \(2.5{\times }\) when increasing the number of GPUs from 32 to 160.

The above works have a similarity in that they obtain their speedups by leveraging more acceleration hardwares. They have difficulties realizing the ideal linear acceleration based on this coarse-grained parallelism strategy. Some research works have paid their attention to lower parallelism granularity level of DNNs. MXNet proposes an idea of dividing LSTM network model into several GPUs by layer. Inter-GPU can perform the computation of layers in a pipelined manner. However, layers in the same GPU still cannot be processed in parallel.

CuDNN [4] is library for optimizing computational functions (e.g. convolution, pooling, and sigmod) for deep learning, which focuses on refining the process inside each layers of neural networks. According to [4], the majority of functions in cuDNN have a straightforward implementation, however, the convolution implementation related to matrix multiplication is not obvious. Since it is not open-source, we can not get more details about its low-level implementation. It is also worth noting that, cuDNN only concerns intra-layer optimization, without considering any optimization approach for inter-layer issue, which is exactly what we want to do in this paper.

Generally, few researches have been done for fine-grained layer-wise parallelism for DNNs by considering both inter-layer and intra-layer issues.

3 Inter-layer Parallelism

3.1 Problem Analysis of Mini-batch Gradient Descent

Mini-batch gradient descent (MBGD) is regarded as one of the main optimization algorithms used in deep learning systems, it consumes less memory and has a high convergence speed. However, limited by the inherent sequentiality of MBGD caused by the data dependency between layers, it is difficult to parallelize the computations between multiple layers.

3.2 Data Pipeline Algorithm

Inspired by the concept of instruction pipeline algorithms, we propose a new algorithm for the processes between multiple layers of neural network models: Data Pipeline Algorithm (DPA). The aim of the DPA is to overcome the limitation of the inherent order of MBGD and enables the computations of layers to be executed in parallel. Therefore, more resources can be used for the training process, which can speed up the training process of models. The main ideas of this algorithm are described in the remaining parts. See Fig. 1 for a depiction of the DPA, and see Algorithm 1 for its detailed procedure.

Fig. 1.
figure 1

Depiction of Data Pipeline Algorithm. \(L_i\) devotes a part of a model, \(T_i\) denotes a thread, \(Q_i\) denotes a message queue, \(B_i, B_j\) denotes a fragment, \(S_i\) denotes a CUDA stream.

figure a

First, we divide a neural network model with N layers into N parts by layer (Line 2 in Algorithm 1). Each part consists of one layer, and is controlled by one CPU thread, which maintains a message queue to exchange messages with other threads. During the algorithm’s operation, each thread monitors its own message queue and performs different computation operations according to different messages (Lines 11–18). These messages include the forward propagation message (FM), the backward propagation message (BM), and the exit message (EM). Figure 1 shows a schematic diagram of the algorithm.

Second, we split each mini-batch into F fragments to reduce the data dependency between layers (Line 4). Each fragment has the same size, and the data dependency is reduced from a mini-batch to a fragment. At a given moment, each fragment is processed by only one layer. However, different layers can process different fragments concurrently. The former can ensure the correctness of the network model training, whereas the latter is designed to reduce the training time of the network model. Specifically, in the i\(^{th}\) iteration process, the first task is to divide one mini-batch into F fragments: \({B_{i,1},B_{i,2},...,B_{i,F}}\). Next, during the process of forward propagation of the network model, after \({L_i}\) finishes computing \({B_{i,f}}\), \({L_i}\) immediately informs \({L_{i+1}}\) to enable the latter to compute the fragment \({B_{i,f}}\) at once. Simultaneously, \({L_i}\) starts to compute \({B_{i,f+1}}\), allowing \({L_i}\) and \({L_{i+1}}\) to compute different fragments in parallel. The process of the backward propagation of the network model is also done the same as the style of the forward propagation, but inversely.

Third, we create N CUDA streams for each thread and issue different operations into different CUDA streams to ensure different operations can be executed in parallel (Lines 15–16). A stream in CUDA is a sequence of operations that execute on the device in the order in which they are issued by the host code. While operations within a stream are guaranteed to be executed in the prescribed order, operations in different streams can be interleaved, and when possible, they can even run concurrently. Considering the computation operations in different streams are executed asynchronously, CUDA API cudaStreamSynchronize is called to synchronize \({S_i}\) to ensure the logic correctness of the algorithm after \({L_i}\) finishes the computation operation in \({S_i}\), and before it notifies the next layer \({L_{i+1}}\) to continue.

Based on the above ideas, the DPA can be performed on a GPU in parallel.

4 Intra-layer Parallelism

DPA only realizes the inter-layer parallelism of a DNN. Actually, inside certain special layers, there is still great parallelism potential to be exploited. The convolution layer that is realized based on matrix multiplication is such a type of layer. In this section, we present a fine-grained intra-layer parallelism strategy by parallelizing the processing of the convolution layer.

4.1 Analysis of Convolution Operation

Figure 2(a) shows the forward propagation of a convolution layer in Caffe, where the size of the mini-batch of input data is six. Because all the input images are submitted to the default CUDA stream, the algorithm eventually is performed in a completely serialized form for all the input data. There is also a similar problem in the backward propagation of the convolution layer. Obviously, this type of realization leads to a serious waste of the computational resources of the GPU even if it can support massive parallel computations. Inevitably, the training time of the entire network model increases. In the following subsection, we show how to optimize this algorithm by parallelizing the convolution operations based on CUDA streams.

4.2 Parallelization of Convolution Layer

Figure 2(b) shows the parallelization of the forward propagation of the convolution layer under an ideal situation, where we assume that operations in different streams can be performed concurrently and where the process time of each image is equal. However, in practical situations, operations in different streams cannot completely run concurrently, and the computation times of each image may be unequal. Figure 2(c) shows such a situation, and Algorithm 2 presents more details.

Fig. 2.
figure 2

The forward propagation of a convolution layer. Here, for convenience of discussion, the batch size of the input data is set to 6, and the number of CUDA streams used in (b) and (c) is set to 3. \(I_i\) denotes the forward propagation of the \(i^{th}\) image of one mini-batch. \(S_s\) denotes the \(s^{th}\) CUDA stream created by users.

figure b

The main idea of Algorithm 2 is to assign different images to different CUDA streams, and to process these images in parallel by utilizing the concurrency of the streams. Specifically, we first get S CUDA streams created in advance (Line 2 in Algorithm 2). Second, during the forward propagation, we need to calculate the offset of the \(n^{th}\) image stored in the top data, and the offset of the \(n^{th}\) image stored in the bottom data (Lines 4–6). Then, we convert the \(n^{th}\) image into a matrix and submit its convolution operation to the \(s^{th}\) stream (Lines 7–10). To ensure the correctness of the process, we need to perform the following two steps. First, to guarantee the data for different streams to be independent, we allocate a buffer(bf) used to convert an image into a matrix for each stream. Second, because multiple streams run concurrently (Fig. 2(b)), we need to synchronize all the streams before starting the next operation (Lines 11–12). In the back propagation of the convolution layer, we take a similar strategy to parallelize the convolution operations. Due to space constraints, it will not be shown here.

5 Experimental Results

5.1 Datasets and Environments

We use four typical image classification datasets and three different hardware environments for the performance evaluation. The datasets include MNIST [12],CIFAR10 [9], CIFAR100 [9], and ImageNet [13]. The specific experimental environment is shown in Table 1.

Table 1. Experimental environment

5.2 Evaluating Inter-layer Parallelism

In this subsection, we evaluate the inter-layer parallelism strategy. Specifically, for different datasets, we analyze the effects of different F on the convergence speed of the network model in different hardware environments. We choose the result of the original Caffe as the benchmark. We take the experimental results of CIFAR10 trained on M1 as an example. The more detailed experimental results are given in Table 2.

Table 2. The experimental results of the inter-layer parallelism

From Table 2, we notice that different F values have different effects on the convergence speed of the network model. When F is 1, which means that the mini-batch is not split into fragments, the model training time is greater than that of the benchmark because of the additional scheduling overhead associated with the DPA. When F is 6, the network model achieves the highest convergence speed, and the speedup SP is 1.51. The convergence speed begins to decrease when F increases further. When F increases to 10, the convergence speed of the network model becomes lower than that with \(F_6\), and the speedup SP decreases to 1.44. The following reasons can account for the above result. First, GPUs are more suitable for handling larger mini-batches, and when a mini-batch is divided into F fragments, the number of iterations of each mini-batch becomes F. Therefore, the total time for the GPU to compute the F fragments is more than the time for the integral mini-batch. Second, whether the operation can be actually executed in parallel is also decided by the amount of computation resources on the GPU. When the value of F reaches a certain threshold, all the computational resources of the GPU are allocated. Then, even if the value of F is further increased, the time for the training process cannot be reduced further. Conversely, because of the increase of the border overheads, the time will deteriorate as F further increases.

The experimental results of other datasets are also given in Table 2. Due to limited space, experimental data for M2 and M3 are not shown in details here. We can draw similar conclusion from the experimental results on M2 and M3.

5.3 Evaluating Intra-layer Parallelism

In this subsection, we evaluate the performance of the proposed intra-layer parallelism strategy based on inter-layer parallelism. Specifically, when the values of F are optimal, we analyze the effects of different S on the convergence speed of the network model in different hardware environments. A detailed overview of the experimental results is given in Table 3. Here, we also take the experimental results of CIFAR10 trained on M1 as an example for detailed explanation.

Table 3. The experimental results of the intra-layer parallelism

From Table 3, we can see that the convergence speed of the network model firstly increases and then decreases with increasing S. When S is 6, the speedup SP achieves the maximum value of 2.19. With further increases in S, the value of SP begins to decrease. The major reason for the above result is that the GPU cannot support too many CUDA streams in parallel because of its limited resources. The second factor is the GPU resources (memory, registers, and blocks) assigned to a single stream. When the value of S reaches a certain threshold, the GPU resources are completely consumed. At this time, even if the value of S is increased further, the processing time cannot be further reduced. The above two factors show that more CUDA streams do not mean gaining further higher speedups. We also can draw similar conclusion from the experimental results on M2 and M3.

6 Conclusions and Future Works

In this work, we propose FiLayer, a fine-grained layer-wise parallelism strategy for deep neural networks, including inter-layer parallelism and intra-layer parallelism. FiLayer is implemented by extending Caffe. We call the FiLayer-based system LP-Caffe. To realize inter-layer parallelism, we propose a fine-grained pipeline algorithm, DPA, which allows several adjacent layers in a network model to be processed in a pipelined manner. For the intra-layer parallelism, we focus on the convolution process. CUDA stream technology is applied to realize the above two fine-grained parallelism strategies. However, we cannot deploy FiLayer over cuDNN yet, because of some confliction between the CUDA stream mechanism of cuDNN and that of the inter-layer parallelism strategy of FiLayer. Since cuDNN is not open-source, we cannot overcome the conflict yet. Therefore, in our experiments, we choose the original Caffe as the benchmark. The experimental results indicate that our proposed FiLayer-based LP-Caffe achieves \(1.58{\times }\)\(2.19{\times }\) speedups compared with the benchmark. In the future, we will focus on combining FiLayer and cuDNN work together, as well as pushing FiLayer to the situations of multiple GPUs and multiple training workers.