OpenBMB: Big Model Systems for Large-Scale Representation Learning

Zeng, Guoyang; Han, Xu; Zhang, Zhengyan; Liu, Zhiyuan; Lin, Yankai; Sun, Maosong

doi:10.1007/978-981-99-1600-9_13

Guoyang Zeng⁴,
Xu Han⁵,
Zhengyan Zhang⁵,
Zhiyuan Liu⁵,
Yankai Lin⁶ &
…
Maosong Sun⁵

2610 Accesses

Abstract

Big pre-trained models (PTMs) have received increasing attention in recent years from academia and industry for their excellent performance on downstream tasks. However, huge computing power and sophisticated technical expertise are required to develop big models, discouraging many institutes and researchers. In order to facilitate the popularization of big models, we introduce OpenBMB, an open-source suite of big models, to break the barriers of computation and expertise of big model applications. In this chapter, we will introduce the core toolkits in OpenBMB, including BMTrain for efficient training, OpenPrompt and OpenDelta for efficient tuning, BMCook for efficient compression, and BMInf for efficient inference.

You have full access to this open access chapter, Download chapter PDF

Overview Machine Learning and Deep Learning Frameworks

Current State and Challenges of Big Data

A survey on data‐efficient algorithms in big data era

Article Open access 26 January 2021

13.1 Introduction

Since the emergence of pre-trained models (PTMs) represented by BERT [7] in 2018 to the subsequent release of GPT-3 [4] with 175 billion parameters in 2020, PTMs have attracted increasing attention from academia and industry. In Chap. 5, we have introduced that these PTMs perform well on various downstream tasks, and model performance can be further improved by increasing model parameter size. From BERT and GPT-3 to the recently proposed big models [11, 18, 39, 40], the parameter sizes of these models have gradually grown from hundreds of millions to trillions. By 2022, the record for the maximum parameter size has been raised to over 1 trillion [11].

Although increasing model parameter size brings better model performance, to let most institutions and individuals enjoy the power of big models still faces several challenges:

Computation Barrier

Bigger models inevitably require higher computing costs in their training, tuning, and inference. For example, as shown in the original paper of GPT-3, more than 1, 000 high-performance GPUs are used to train GPT-3. Most small- and medium-sized institutes cannot afford such a colossal computing cluster. Moreover, even after completing the training process, deploying these big models for a specific application still requires to cost thousands of dollars at a time to perform the model inference, indicating that in addition to training big models, using these big models is also expensive.

Expertise Barrier

Many deep learning techniques that work well on small models become inefficient or inapplicable on big models. For example, full-parameter fine-tuning has been widely used to tune PTMs for solving specific downstream tasks. If full-parameter fine-tuning is performed to adapt big models for each downstream task, we have to spend terabytes of disk and memory space to store all task-specific fine-tuned models. Moreover, as parameter size grows to billions, some novel characteristics specific to big models emerge, such as in-context learning [26], chain of thoughts [36], and the lottery ticket hypothesis [13]. Hence, novel techniques such as prompt learning [24], delta tuning [9], and MoEfication [42] are specifically developed for big models. These techniques form an expertise barrier to those researchers and practitioners who are interested in big models and do not have extensive experience.

To address these barriers, we introduce OpenBMB, an open-source suite for building big model systems more efficiently, to make big models available to everyone. OpenBMB contains several toolkits for different application scenarios of big models, including BMTrain for efficient training (in Sect. 13.2), OpenPrompt and OpenDelta for efficient tuning (in Sect. 13.3), BMCook for efficient compression (in Sect. 13.4), and BMInf for efficient inference (in Sect. 13.5). Figure 13.1 shows how these toolkits work together to form an effective and efficient big model system. In this chapter, we will introduce these toolkits in detail, covering the critical technologies involved in these toolkits and how to use code to run these toolkits. More details of advanced big model techniques, such as prompt learning and delta tuning, can be found in Chap. 5.

An architecture model is divided into sections of training, tuning, and inference. Training includes data source, raw data, cleaned data, and B M train. Tuning includes a model center classification of C P M 1, 2, and task and prompted data, among others connected to inference with compressed P L M. — **Fig. 13.1**

13.2 BMTrain: Efficient Training Toolkit for Big Models

The success of big models is due to large-scale data and huge parameter size. The large-scale data facilitates big models to acquire versatile knowledge during the pre-training stage, while the huge parameter size enables big models to store the acquired knowledge well. However, utilizing large-scale data and huge parameter sizes is a double-edged sword, simultaneously bringing significant performance enhancement and a severe computation barrier. The key to solving the computational barrier of big model training is to adopt more efficient distributed learning strategies to take full advantage of distributed computing clusters. Next, we will introduce the distributed learning strategies used in BMTrain and how to use BMTrain to set these strategies to train big models efficiently.

13.2.1 Data Parallelism

Intuitively, training a big model on a GPU cluster is similar to organizing an event with many partners. Imagine that you have a lot of tasks to do and you want to get them done as quickly as possible; a good solution is to distribute these tasks equally to your partners so that everyone is busy and no one is idle. This is the principle of data parallelism. As shown in Fig. 13.2, the training data is divided into several parts, and each GPU is responsible for training a part of the data. For the model parameters, each GPU stores the whole model. In each iteration, each GPU calculates the gradient for its own part. After all the gradients are computed, the gradients of different GPUs are averaged together as the overall gradient. The overall gradient is then passed to the optimizer to update the model parameters. Finally, the updated parameters are sent back to each GPU for the next iteration. This process is repeated until the training is finished.

A 3-part flow diagram of interconnected G P Us 1, 2, and N with loops of updates. G P U 1 includes data 1, parameters, gradients 1, and average gradients. G P U 2 includes data 2, parameters, gradients 2, and average gradients. G P U N includes data N, parameters, gradients N, and average gradients. — **Fig. 13.2**

Data parallelism is largely similar to training a model on a single GPU, except that data parallelism adds two additional stages of gradient aggregation and parameter synchronization. Since the training data is evenly partitioned to each GPU, each GPU performs the same amount of computations. As a result, all GPUs can work concurrently to achieve the highest utilization, rather than having some nodes idle while others are under high load.

Data parallelism can effectively utilize large-scale computing power clusters, but some limitations are gradually exposed as the model size increases. Since data parallelism requires keeping the complete model parameters on each GPU, this method is not competent when the number of model parameters exceeds the capacity of one single GPU. For example, for a model with 10 billion parameters, it would take more than 40 GB of memory to store the model parameters and gradients, which is far beyond the capacity of most GPUs. Besides, GPUs with a capacity of 40 GB are expensive and cost inefficient. This parallel strategy needs to be further improved to enable big models to be trained with limited resources.

13.2.2 ZeRO Optimization

ZeRO (zero redundancy optimizer) [29] is a strategy that allows efficient training for models with a parameter size far exceeding the capacity of one single GPU. As mentioned in the data parallelism section, the complete model parameters, gradients, or optimizer states cannot be stored on a single GPU to train big models. As shown in Fig. 13.3, the ZeRO algorithm is proposed to further optimize data parallelism by partitioning the model parameters, gradients, and optimizer states evenly to each GPU and only temporarily aggregating them when needed. During the forward and backward propagations, the model parameters are aggregated twice, and the model gradients are aggregated only once while updating the optimizer states. Therefore, ZeRO needs to communicate three times in each iteration, one more time than data parallelism.

A 3-part flow diagram of interconnected G P Us 1, 2, and N with loops of updates. They have gradients 1, 2, and N connected to parameters and data 1, 2, and N via forward, all gather, backward, and reduce scatter. — **Fig. 13.3**

Although bringing additional communication, the ZeRO algorithm can train big models on a GPU cluster with limited capacity. For example, with ZeRO, it is possible to train a model with 175 billion parameters on 64× A100 GPUs, while data parallelism can only train a model with no more than 3 billion parameters. Empirically, if the number of GPUs is increased indefinitely, the number of model parameters that can be trained using the ZeRO algorithm can also be increased indefinitely.

13.2.3 Quickstart of BMTrain

BMTrain is an open-source toolkit for big model training. It provides easy-to-use interfaces based on PyTorch to help users accelerate the training process using the data parallelism and ZeRO algorithms. For commonly used model architectures such as Transformers, BMTrain implements a series of customized optimizations and makes writing distributed training code just like writing single-node training code. With BMTrain, only four steps are required to modify a single-node training code to drive a distributed cluster for training acceleration.

Step 1: Initializing BMTrain

Since BMTrain is built based on PyTorch, similar to PyTorch calling the function init_process_group at the beginning of code to initialize distributed training, BMTrain requires to use init_distributed at the beginning of code to initialize BMTrain (Fig. 13.4). The function init_distributed also supports users to set random seeds. By setting random seeds, BMTrain can provide users with a reliable and reproducible mechanism to control all random processes, ensuring that multiple runs of the same code can yield stable results.

A snippet of a 7-line code. It includes functions of import b m train, b m t dot i n i t, seed, and zero underscore level, among others. — **Fig. 13.4**

Step 2: Enabling ZeRO Optimization

After initializing BMTrain, the data parallel and ZeRO algorithms can be applied by making some simple modifications to the single-node training code. As shown in Fig. 13.5, users should modify the model construction code to replace Module and Parameter in PyTorch with DistributedModule and DistributedParameter implemented in BMTrain. To further alleviate memory overhead, wrapping some model layers with CheckpointBlock enables the activation checkpointing mechanism, details of which can be found in the work [6].

2 snippets of an 11-line code and a 12-line code. They include functions of import torch, import b m train, class, d e f, super, self dot param, self dot module underscore list, some transformer block, and b m t dot checkpoint block, among others. — **Fig. 13.5**

Step 3: Communication Optimization (Optional)

For those model architectures with multiple layers, such as Transformers, BMTrain provides a communication optimization module to reduce the time overhead by overlapping communication and computation. As shown in Fig. 13.6, to enable this optimization, users need to replace ModuleList in PyTorch with TransformerBlockList implemented in BMTrain.

2 snippets of 17-line codes labeled A and B. It includes functions of import, import b m train, class, d e f, super, self dot param, self dot module underscore list, some transformer block, b m t dot checkpoint block, d e f, for, x, and return, among others. — **Fig. 13.6**

Step 4: Launching the Distributed Training

The code optimized by BMTrain can be run using the same launcher as the PyTorch distributed module. Figure 13.7 shows the distributed training command depending on specific PyTorch versions, and this command should be executed on all nodes in the cluster at the same time. In the launching command, users should specify the communication protocol, including the IP address and port number.

A snippet of a 10-line code. It includes functions of hashtag PyTorch version, python 3 negative m, negative negative n nodes, negative negative master, hashtag PyTorch version, torch run, and negative negative r d z v, among others. — **Fig. 13.7**

13.3 OpenPrompt and OpenDelta: Efficient Tuning Toolkit for Big Models

In Chap. 5, we have introduced the general capabilities of big models and their excellent performance on a wide range of downstream tasks. In this section, we focus on showing the critical role that prompt learning and delta tuning play in applications and how to adapt big models to downstream tasks using OpenPrompt and OpenDelta. More details of prompt learning and delta tuning can be found in Chap. 5.

13.3.1 Serving Multiple Tasks with a Unified Big Model

Tuning models is an essential part of the practical application of big models, which can adapt the generic capabilities of big models to specific tasks. Due to big models’ high capability and generality, a big model can be adapted to dozens or hundreds of different downstream tasks. Before prompt learning and delta tuning, full-parameter fine-tuning was the mainstream method to adapt big models, where a complete model was fine-tuned and stored for each downstream task. As the parameter size of models increases, the overhead of the full-parameter fine-tuning approach on the memory storage becomes more and more heavy. Moreover, as the number of downstream tasks continues to increase, so does the memory requirement to deploy these models. These issues associated with full-parameter fine-tuning seriously increase the cost of using and deploying big models.

As compared with full-parameter fine-tuning, prompt learning and delta tuning are more friendly to big models, because these two special tuning approaches can adapt big models to different downstream tasks by changing only a small number of parameters or even without changing model parameters, as shown in Fig. 13.8. Such a feature solves the storage problem of big models caused by full-parameter fine-tuning and further reduces the problem of heavy resource consumption when deploying multiple models for different tasks. In addition, both prompt learning and delta tuning usually design task instructions in the form of natural languages to stimulate the knowledge of big models to solve specific problems. Compared with using a programming language to control big models, using task instructions to control big models is more human-friendly.

A flow diagram of the delta center. It includes connected task delta 1, 2, and 3, tasks 1 to 3, requests 1 to N, frozen big model, and results 1 to N. — **Fig. 13.8**

Based on prompt learning and delta tuning, we can deploy a big model on a GPU cluster and dynamically task-specific task instructions or delta objects rather than task-specific models. These task instructions and delta objects can cooperate with the big model to handle multiple downstream tasks. In this way, the resources consumed during deployment will not grow with the increase of downstream tasks. Aggregating all tasks into the same cluster also brings higher utilization, avoiding the cluster being partially idle due to unbalanced task requests.

13.3.2 Quickstart of OpenPrompt

OpenPrompt is an open-source prompt learning framework with high extensibility. OpenPrompt supports a variety of mainstream prompt learning methods [21, 30] and can help users more easily apply prompt learning on existing models or develop new models. As shown in Fig. 13.9, a PromptModel consists of a PTM as the backbone, one or more Templates to wrap the raw input with task instructions, and one or more Verbalizers to map the task labels to the vocabulary of PTMs. Developing a prompt learning pipeline using OpenPrompt requires the following steps to build a PromptModel.

A 2-part flow diagram of the prompt dataset and prompt model connected to the prompt trainer. Dataset includes a flow of dataset, template via example, and prompt tokenizer via wrapped example. Model includes a flow of template embeddings, P T Ms via input for P T Ms, and verbalizer via logits. — **Fig. 13.9**

Step 1: Defining the Task

The first step in using OpenPrompt is defining the details of the task. Taking sentiment classification as an example, which aims to judge whether the input sentence is positive or negative, Fig. 13.10 shows how to define the dataset used for the prompt learning of the sentiment classification task.

A snippet of a 22-line code. It includes functions of import, from, hashtag there are two classes, classes, negative within double quotes, positive within double quotes, hashtag for simplicity, hashtag text underscore a, dataset, and input example, among others. — **Fig. 13.10**

Step 2: Loading the PTM

A PTM is the backbone of PromptModel. OpenPrompt supports directly loading models obtained from some online model hubs, such as ModelCenter^{Footnote 1} built on BMTrain and Huggingface Transformers^{Footnote 2} built on PyTorch (Fig. 13.11).

A snippet of a 6-line code. It includes functions of import, from, hashtag here we use BERT as an example, p l m, tokenizer, and model underscore config, among others. — **Fig. 13.11**

Step 3: Defining the Task Instruction

At least one task instruction is required to make a PromptModel. The Template is used to modify the original input with the defined task instruction. Figure 13.12 shows how to join the task instruction It was and the field text_a of the dataset defined in Step 1.

A snippet of an 8-line code. It includes functions of import, from, prompt template, text, and tokenizer, among others. — **Fig. 13.12**

Step 4: Defining the Verbalizer

At least one verbalizer is required for PromptModel. Verbalizer is an important part of prompt learning that maps the output words to the task labels. Figure 13.13 maps the word bad to the negative label and maps good, wonderful, and great to the positive label.

A snippet of a 12-line code. It includes functions of import, from, prompt verbalizer, classes, label underscore words, and tokenizer, among others. — **Fig. 13.13**

Step 5: Combining Different Modules into a PromptModel

Based on the PTM, task instruction, and verbalizer obtained earlier, we can define a PromptModel using the code in Fig. 13.14.

A snippet of a 9-line code. It includes functions of import, from, prompt model, template, p l m, and verbalizer, among others. — **Fig. 13.14**

Step 6: Defining the PromptDataLoader

In order to learn the defined prompt, we also need a dataloader to sample data from the dataset. The PromptDataLoader in Fig. 13.15 is an extension of the PyTorch dataloader used to sample data for prompt learning.

A screenshot of a 10-line code. It includes functions of import, from, data underscore loader, dataset, tokenizer, template, and tokenizer underscore wrapper underscore class, among others. — **Fig. 13.15**

Step 7: Training and Inference

After defining the PromptModel and PromptDataLoader, we can train and infer the defined prompt. All the code for training and inference can be implemented with PyTorch. Figure 13.16 represents an inference example based on OpenPrompt.

A screenshot of a 10-line code. It includes functions of import, prompt model, hashtag the predictions would be 0 and 1, with torch, for, logits, p r e d s, and print, among others. — **Fig. 13.16**

13.3.3 QuickStart of OpenDelta

OpenDelta is an open-source delta tuning toolkit that can perform delta tuning without modifying the code of the backbone PTM. By using OpenDelta, users could easily implement prefix tuning [23], adapter tuning [20], LoRA [17], or any other types of delta tuning. OpenDelta also supports sharing delta objects so that users can load the delta objects trained by others or save and publish users’ own delta objects. To adapt PTMs using OpenDelta only needs the following three steps.

Step 1: Loading the PTM

Similar to OpenPrompt, OpenDelta requires a PTM to be loaded first as the backbone for subsequent tuning. The code in Fig. 13.17 shows how to load BART [22].

A snippet of a 5-line code. It includes functions of from, import, hashtag load BART as the backbone, and model, among others. — **Fig. 13.17**

Step 2: Adding the Delta Object

After loading the PTM, OpenDelta requires users to specify the parameters to be tuned (i.e., delta object). The code in Fig. 13.18 shows how to use the built-in modifications (i.e., adapter layers) in OpenDelta to specify the parameters that require to be tuned.

A snippet of a 6-line code. It includes functions of import, from, hashtag this will apply adapter tuning, and delta underscore model, among others. — **Fig. 13.18**

Step 3: Freezing the Backbone

After adding the delta object, the parameters of the backbone model are not automatically frozen. Users need to use the freeze_module method provided by OpenDelta to freeze the backbone model (Fig. 13.19).

A snippet of a 5-line code. It includes functions of import, hashtag freeze the parameters of the backbone, and delta under score model, among others. — **Fig. 13.19**

After completing the above three steps, the model can be trained with the training script. After the training is completed, users can also use the state_dict interface to obtain the delta object’s parameters, similar to obtaining parameters in PyTorch. The delta object obtained using OpenDelta usually only needs very little space to store, which is very space-efficient and can be easily stored and shared with others.

13.4 BMCook: Efficient Compression Toolkit for Big Models

The research on model compression started long ago, and in recent years several vital directions, such as quantization [2, 3, 31, 32], distillation [16, 19, 28, 33], and pruning [5, 10, 14, 35, 37, 38, 43], have been widely explored. Before the emergence of big models, model compression techniques were mainly applied to adapt models to various low-resource end devices, such as cell phones and cameras, or to some real-time applications that require low-latency inference. After the popularity of big models, the inference of these big models requires more expensive high-end devices than conventional deep models, making the compression process more critical.

In order to make big models run on common devices like various consumer GPUs, we have to combine different compression techniques to minimize the parameter size of these models without degrading the model performance too much. Therefore, as shown in Figure 13.20, BMCook systematically implements model quantization, model distillation, model pruning, and model MoEfication. To take full advantage of these compression techniques, BMCook builds a unified compression framework to support the combination of these compression techniques. Under this unified framework, BMCook can support arbitrary combinations of different compression techniques. Next, we will introduce typical model compression techniques and how to use these techniques to compress big models with the help of BMCook. And more details about BMInf can be found in the paper [41].

A 4-part flow diagram of quantization, distillation, pruning, and MoEfication. They include self-attention and F F N layers. — **Fig. 13.20**

13.4.1 Model Quantization

Model quantization aims to represent the parameters of big models with those lower-bit data types rather than the commonly used 32-bit floating-point types. By using lower-bit data types to represent model parameters, both the memory and computation costs can be significantly reduced. For example, representing a model with an 8-bit fixed-point type is 4 times faster than representing the same model with a 32-bit floating-point type.

The widely used model quantization methods mainly follow two paradigms to quantize model parameters: post-training quantization (PTQ) and quantization-aware training QAT. PTQ [3] aims to directly quantize model parameters after the model learning is completed. PTQ is simple, but it may bring significant performance degradation since lower-bit data types simultaneously bring efficiency improvement and accuracy degradation. QAT [32] is proposed to alleviate the degradation caused by quantization. Specifically, QAT simulates the quantization process during learning models so that model parameters can be quantized with the guide of training data. With linear layers as an example, QAT replaces all linear layers of models with quantized linear layers. In quantized linear layers, the matrix multiplication operation is replaced by the quantized matrix multiplication.

Existing popular deep learning frameworks, such as PyTorch and TensorFlow, have already supported PTQ. Considering this and toward better performance, BMCook mainly adopts QAT to compress models.

13.4.2 Model Distillation

Model distillation aims to transfer model knowledge from larger teacher models to smaller student models. Conventional distillation methods mainly focus on adding the KL divergence between the output results of teacher models and those of student models as an additional training objective [16], so that student models perform similarly to teacher models.

After the emergence of PTMs, making the inner computation between student models and teacher students closer facilitates distilling PTMs [19, 28, 33]. Specifically, these distillation methods add the MSE loss between student and teacher models’ hidden states. Note that distillation methods only provide additional training objectives rather than directly reducing the parameter size. Owing to this, any other compression methods can be combined with distillation methods to further improve the performance of compressed models.

Both the distillation methods based on the KL-divergence and MSE loss functions are implemented in BMCook.

13.4.3 Model Pruning

Model pruning is widely used to prune redundant model parameters. The existing model pruning methods mainly follow two paradigms to prune model parameters: structured pruning and unstructured pruning. Structured pruning aims to remove complete redundant modules such as model layers [10, 35, 37, 43], yet unstructured pruning focuses on removing individual parameters [5, 14, 38]. Since the pruned parameters do not need to be stored in memory and are also not involved in model computation, model pruning can thus reduce the requirements of memory and computing power.

In BMCook, both structured pruning and unstructured pruning are supported. However, unstructured pruning cannot guarantee efficiency gain since existing parallel processing devices (e.g., GPUs) do not sufficiently support arbitrary sparse computation operations [44]. Considering that 2:4 sparsity is well supported by Sparse Tensor Core [45], BMCook thus implements the unstructured pruning with 2:4 sparsity, i.e., BMCook forces every four continuous parameters to have two zeros. In this way, BMCook can guarantee that its sparse operations can be at least twice as fast as dense ones.

13.4.4 Model MoEfication

Since Transformers adopt ReLU [27] as the activation function of the feedforward networks (FFNs), bringing a sparse activation phenomenon, we can only use the part of FFNs for a specific input without affecting the model performance. To this end, model MoEfication [42] is proposed to transform Transformers to the mixture-of-expert (MoE) versions [11], which can significantly reduce the computation costs of Transformers. Model MoEfication only selects parts of model parameters for computation rather than changing or removing model parameters. To this end, model MoEfication can be viewed as a post-processing technique that can be applied to an already compressed model to further improve efficiency.

13.4.5 QuickStart of BMCook

Compressing a model with BMCook requires to set concrete compression strategies using a configuration file. The configuration file includes hyper-parameter settings, model configurations, and which compression methods to use for model compression. Figure 13.21 is an example of the configuration file. In the configuration file, we can set whether to use each compression method or not and determine which modules of the model each compression method is used for.

A snippet of a 21-line code. It includes functions of distillation, c e underscore scale, m s e underscore h i d n underscore scale, pruning, is underscore pruning, pruning underscore mask underscore path, quantization, is underscore quant, MoEfication, and is underscore moefy, among others. — **Fig. 13.21**

After setting up the configuration file, some code needs to be modified to drive BMCook to compress the model. Next, we will introduce how to use BMCook to run each compression method.

Model Quantization

As shown in Fig. 13.22, to perform QAT in BMCook, we only need to use the function BMQuant.quantize to operate the given model. The function will automatically replace all linear modules in the model with QAT linear modules.

A snippet of a 6-line code. It includes functions of import, from, hashtag config is the configuration file, and B M quant, among others. — **Fig. 13.22**

Model Distillation

As shown in Fig. 13.23, to perform model distillation in BMCook, we need to use the function BMDistill.set_forward to combine the distillation loss function with the original loss function of the student model.

A snippet of a 7-line code. It includes functions of import, from, hashtag config, hashtag forward underscore f n, and B M distill, among others. — **Fig. 13.23**

Model Pruning

In BMCook, model pruning is implemented based on the pruning masking mechanism, which requires to modify the optimizer to freeze those masked parameters. As shown in Fig. 13.24, model pruning in BMCook can be performed by using BMPrune.compute_mask and BMPrune.set_optim_for_pruning, which can help the model compute the pruning masking matrix and modify the optimizer.

A snippet of a 10-line code. It includes functions of import, from, hashtag config, hashtag set the pruning masking matrix, B M prune, and hashtag modify, among others. — **Fig. 13.24**

Model MoEfication

Model MoEfication is to transform a dense model into a sparse one. It requires the hidden states during the forward propagation to learn how to select experts. As shown in Fig. 13.25, to perform MoEfication it is a need to use BMMoE.get_hidden.

A snippet of a 7-line code. It includes functions of import, from, hashtag config, hashtag retain the hidden states, and B M M o E dot get underscore hidden, among others. — **Fig. 13.25**

13.5 BMInf: Efficient Inference Toolkit for Big Models

How to achieve efficient model inference has long been of great interest to the industry. ONNX,^{Footnote 3} proposed in 2017, is a general model representation format compatible with various mainstream deep learning frameworks such as PyTorch and TensorFlow. Owing to the excellent compatibility of the ONNX format, many tools for accelerating model inference based on the ONNX format have sprung up. The most well-known toolkits are TensorRT^{Footnote 4} and onnxruntime.^{Footnote 5} Up to now, transforming big models to the ONNX format and then using TensorRT or onnxruntime for model inference have become a widely used paradigm in the industry. FasterTransformer^{Footnote 6} is released by NVIDIA in 2019, which is a toolkit for the efficient inference of Transformer-based models on CUDA devices, considering that recently proposed big models are mainly based on Transformers.

Although these toolkits for accelerating model inference have achieved promising results, they cannot meet inference requirements for those models with more than 10 billion parameters. The main bottleneck of big model inference is GPUs’ computing power and capacity. For a model with 10 billion parameters, at least 20 GB of memory is required to store the model, and a throughput of over 10¹³ FLOPs is also required to reach a usable inference speed. Such requirements are beyond the computing power and capacity of most GPUs. To make it possible to run big models on consumer GPUs, we need to leverage more heterogeneous devices such as CPUs and RAMs. Moreover, we also need to apply quantization and memory scheduling techniques to reduce the computing power and memory requirements of big models. These techniques, which need to be better supported, are well integrated into BMInf. Next, we will introduce these inference techniques and how to use BMInf for model inference. And more details about BMInf can be found in the paper [15].

13.5.1 Accelerating Big Model Inference

Optimizing the efficiency of model inference is usually closely related to specific hardware platforms. Since CUDA-based GPUs are now widely used in academia and industry, we focus on introducing inference acceleration techniques around the GPUs based on the CUDA architecture.

One of the most common optimizations is kernel fusion [1, 12, 34]. In the CUDA architecture, the basic operation unit is the kernel, and each kernel needs to read data from and write results back to global memory. Complex operators of neural networks usually require multiple kernels to cooperate, leading to redundant computation and storage. For example, for batch normalization, if it is implemented as the combination of multiplication and addition kernels, the intermediate variables need to be written back to global memory. However, using kernel fusion to integrate the multiplication and addition kernel into one kernel, the intermediate variables can be stored in registers rather than global memory, which would be much faster (Fig. 13.26).

A diagram of batch normalization, without kernel fusion, and with kernel fusion. Normalization includes an equation of y = alpha x + beta. Without kernel fusion includes load, multiplication, save, load, add, and save. With kernel fusion includes load, multiplication, add, and save. — **Fig. 13.26**

Another common optimization is model quantization. In Sect. 13.4, we have described how to use BMCook for model quantization, and we thus will not introduce more quantization details here. For CUDA-based devices, using shorter data types (i.e., lower-bit data types) is usually much faster than using longer ones. Therefore, model quantization can significantly improve the efficiency of model inference. Using different quantization strategies may result in different running efficiencies for different models. For example, for a model based on Transformers, the main time overhead is in the computation of its linear layers, and quantizing the linear layers of this model will lead to efficiency improvements. Such a quantization strategy is implemented in BMInf.

13.5.2 Reducing the Memory Footprint of Big Models

Due to the huge parameter size, only using GPUs for big model inference requires each GPU to have a huge memory capacity. This requirement raises the threshold for applying big models to specific tasks. In order to run big models on consumer GPUs with limited memory capacities, it is critical to take advantage of CPUs to alleviate the requirement for GPU storage capacity. Specifically, when performing big model inference, we can put the part of parameters that will not be used in a short time on CPUs instead of storing all parameters on GPUs all the time. These parameters stored on CPUs will be transferred to GPUs when they need to participate in the computation.

Note that such storage optimization comes with a price. The frequent passing of parameters between CPUs and GPUs requires non-negligible additional communication time. To this end, along with using both GPUs and CPUs, CPU-GPU scheduling is also applied to overlap the time of passing parameters and model computation. By using the CPU-GPU scheduling, the parameters that GPUs will use soon can be transferred from CPUs to GPUs beforehand.

Based on accelerating big model inference and reducing the memory footprint of big models, it is possible to run extremely big models on a single small-capacity GPU with limited computation overhead. BMInf implements kernel fusion, model quantization, and automatic CPU-GPU scheduling at the layer level. Table 13.1 shows the performance of BMInf running a big model with 10 billion parameters on different GPUs. From the table, we can see that combining all the above techniques enables us to run a big model even on a single GPU with only 6 GB capacity.

Table 13.1 The inference performance of the big model with 10 billion parameters on different GPUs using PyTorch with BMInf

Full size table

13.5.3 QuickStart of BMInf

As shown in Fig. 13.27, BMInf is an out-of-the-box toolkit that requires only minor modifications to the original model inference code. After implementing the model requiring to perform inference with PyTorch, users only need to load the parameters of the model on the CPU and then use the wrapper function of BMInf to wrap the model. The wrapped model can automatically apply optimization techniques according to specific hardware conditions during the inference process without manual intervention. As shown in Table 13.1, just with a few lines of code, BMInf can help a big model achieve inference speeds far exceeding its PyTorch implementation on consumer GPUs.

A snippet of a 12-line code. It includes functions of import, hashtag use PyTorch, model, hashtag load the parameters, model dot load underscore state underscore d i c t, hashtag apply the wrapper, and with torch dot cuda dot device, among others. — **Fig. 13.27**

13.6 Summary and Further Readings

In this chapter, we review the progress of big models in NLP and highlight the difficulties and corresponding solutions to run big models for practical applications. Then, we introduce how to use the toolkits of OpenBMB to achieve efficient training, tuning, compression, and inference. To learn more about the latest progress of big models, please follow BMList,^{Footnote 7} which contains information about the latest release of big models and is very helpful for readers to understand the trends of big models. We also recommend finding more details on efficient and efficient big models in Chap. 5.

Notes

References

Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale. In Proceedings of SC, 2022.
Google Scholar
Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, Jin Jin, Xin Jiang, Qun Liu, Michael Lyu, and Irwin King. Binarybert: Pushing the limit of bert quantization. In Proceedings of ACL, 2021.
Google Scholar
Ron Banner, Yury Nahshan, and Daniel Soudry. Post training 4-bit quantization of convolutional networks for rapid-deployment. Proceedings of NeurIPS, 2019.
Google Scholar
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Proceedings of NeurIPS, 2020.
Google Scholar
Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and Michael Carbin. The lottery ticket hypothesis for pre-trained bert networks. In Proceedings of NeurIPS, 2020.
Google Scholar
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, 2019.
Google Scholar
Ning Ding, Shengding Hu, Weilin Zhao, Yulin Chen, Zhiyuan Liu, Haitao Zheng, and Maosong Sun. Openprompt: An open-source framework for prompt-learning. In Proceedings of ACL: System Demonstrations, 2022.
Google Scholar
Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models. arXiv preprint arXiv:2203.06904, 2022.
Google Scholar
Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with structured dropout. In Proceedings of ICLR, 2019.
Google Scholar
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
MathSciNet Google Scholar
Jiří Filipovič, Matúš Madzin, Jan Fousek, and Luděk Matyska. Optimizing cuda code by kernel fusion: application on blas. The Journal of Supercomputing, 71(10):3934–3957, 2015.
Article Google Scholar
Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In Proceedings of ICLR, 2018.
Google Scholar
Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Proceedings of NeurIPS, 2015.
Google Scholar
Xu Han, Guoyang Zeng, Weilin Zhao, Zhiyuan Liu, Zhengyan Zhang, Jie Zhou, Jun Zhang, Jia Chao, and Maosong Sun. BMInf: An efficient toolkit for big model inference and tuning. In Proceedings of ACL: System Demonstrations, 2022.
Google Scholar
Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7), 2015.
Google Scholar
Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. In Proceedings of ICLR, 2021.
Google Scholar
Sumit Kumar Jha, Rickard Ewetz, Alvaro Velasquez, Laura Pullum, and Susmit Jha. BLOOM large language models and the chomsky hierarchy. 2022.
Google Scholar
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. In Findings of EMNLP, 2020.
Google Scholar
Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efficient low-rank hypercomplex adapter layers. In Proceedings of NeurIPS, 2021.
Google Scholar
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of EMNLP, 2021.
Google Scholar
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of ACL, 2020.
Google Scholar
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of ACL-IJCNLP, 2021.
Google Scholar
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586, 2021.
Google Scholar
Zhiyuan Liu, Yankai Lin, and Maosong Sun. Representation Learning for Natural Language Processing. Springer, 2020.
Book Google Scholar
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022.
Google Scholar
Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of ICML, 2010.
Google Scholar
Geondo Park, Gyeongman Kim, and Eunho Yang. Distilling linguistic context for language model compression. In Proceedings of EMNLP, 2021.
Google Scholar
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models. In Proceedings of SC, 2020.
Google Scholar
Timo Schick, Helmut Schmid, and Hinrich Schütze. Automatically identifying words that can serve as labels for few-shot text classification. In Proceedings of COLING, 2020.
Google Scholar
Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of AAAI, 2020.
Google Scholar
Pierre Stock, Angela Fan, Benjamin Graham, Edouard Grave, Rémi Gribonval, Herve Jegou, and Armand Joulin. Training with quantization noise for extreme model compression. In Proceedings of ICLR, 2020.
Google Scholar
Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowledge distillation for BERT model compression. In Proceedings of EMNLP-IJCNLP, 2019.
Google Scholar
Xiaohui Wang, Ying Xiong, Xian Qian, Yang Wei, Lei Li, and Mingxuan Wang. Lightseq2: Accelerated training for transformer-based models on gpus. arXiv preprint arXiv:2110.05722, 2021.
Google Scholar
Ziheng Wang, Jeremy Wohlwend, and Tao Lei. Structured pruning of large language models. In Proceedings of EMNLP, pages 6151–6162, 2020.
Google Scholar
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
Google Scholar
Mengzhou Xia, Zexuan Zhong, and Danqi Chen. Structured pruning learns compact and accurate models. In Proceedings of ACL, 2022.
Google Scholar
Dongkuan Xu, Ian En-Hsu Yen, Jinxi Zhao, and Zhibin Xiao. Rethinking network pruning–under the pre-train and fine-tune paradigm. In Proceedings of NAACL-HLT, 2021.
Google Scholar
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. GLM-130B: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
Google Scholar
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
Google Scholar
Zhengyan Zhang, Baitao Gong, Yingfa Chen, Xu Han, Guoyang Zeng, Weilin Zhao, Yanxu Chen, Zhiyuan Liu, and Maosong Sun. BMCook: A task-agnostic compression toolkit for big models. In Proceedings of EMNLP: System Demonstrations, 2022.
Google Scholar
Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. Moefication: Conditional computation of transformer models for efficient inference. arXiv preprint arXiv:2110.01786, 2021.
Google Scholar
Zhengyan Zhang, Fanchao Qi, Zhiyuan Liu, Qun Liu, and Maosong Sun. Know what you don’t need: Single-shot meta-pruning for attention heads. AI Open, 2:36–42, 2021.
Article Google Scholar
Ningxin Zheng, Bin Lin, Quanlu Zhang, Lingxiao Ma, Yuqing Yang, Fan Yang, Yang Wang, Mao Yang, and Lidong Zhou. SparTA: Deep-learning model sparsity via tensor-with-sparsity-attribute. In Proceedings of OSDI, 2022.
Google Scholar
Aojun Zhou, Yukun Ma, Junnan Zhu, Jianbo Liu, Zhijie Zhang, Kun Yuan, Wenxiu Sun, and Hongsheng Li. Learning n:m fine-grained structured sparse neural networks from scratch. In Proceedings of ICLR, 2021.
Google Scholar

Download references

Acknowledgements

The contributions of all authors for the second edition are as follows: Zhiyuan Liu, Yankai Lin, and Maosong Sun designed the overall architecture of this chapter; Guoyang Zeng, Xu Han, and Zhengyan Zhang drafted the chapter; Xu Han, Zhiyuan Liu, and Yankai Lin proofread the chapter.

The initial materials in the first edition were provided by Xu Han, Zhengyan Zhang, and Cheng Yang.

This chapter is the resource chapter of the second edition of the book Representation Learning for Natural Language Processing, with its first edition published in 2020 [25]. As compared to the first edition of this chapter, the main changes include the following: (1) this edition focuses more on the introduction of open-source toolkits of big models, and (2) this edition adds example code to introduce how to use these toolkits.

Author information

Authors and Affiliations

ModelBest Inc. and OpenBMB, Beijing, China
Guoyang Zeng
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Xu Han, Zhengyan Zhang, Zhiyuan Liu & Maosong Sun
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
Yankai Lin

Authors

Guoyang Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Xu Han
View author publications
You can also search for this author in PubMed Google Scholar
Zhengyan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yankai Lin
View author publications
You can also search for this author in PubMed Google Scholar
Maosong Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiyuan Liu .

Editor information

Editors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, China
Zhiyuan Liu
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
Yankai Lin
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Maosong Sun

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zeng, G., Han, X., Zhang, Z., Liu, Z., Lin, Y., Sun, M. (2023). OpenBMB: Big Model Systems for Large-Scale Representation Learning. In: Liu, Z., Lin, Y., Sun, M. (eds) Representation Learning for Natural Language Processing. Springer, Singapore. https://doi.org/10.1007/978-981-99-1600-9_13

Download citation

DOI: https://doi.org/10.1007/978-981-99-1600-9_13
Published: 24 August 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-1599-6
Online ISBN: 978-981-99-1600-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics