13.1 Introduction

Since the emergence of pre-trained models (PTMs) represented by BERT [7] in 2018 to the subsequent release of GPT-3 [4] with 175 billion parameters in 2020, PTMs have attracted increasing attention from academia and industry. In Chap. 5, we have introduced that these PTMs perform well on various downstream tasks, and model performance can be further improved by increasing model parameter size. From BERT and GPT-3 to the recently proposed big models [11, 18, 39, 40], the parameter sizes of these models have gradually grown from hundreds of millions to trillions. By 2022, the record for the maximum parameter size has been raised to over 1 trillion [11].

Although increasing model parameter size brings better model performance, to let most institutions and individuals enjoy the power of big models still faces several challenges:

Computation Barrier

Bigger models inevitably require higher computing costs in their training, tuning, and inference. For example, as shown in the original paper of GPT-3, more than 1, 000 high-performance GPUs are used to train GPT-3. Most small- and medium-sized institutes cannot afford such a colossal computing cluster. Moreover, even after completing the training process, deploying these big models for a specific application still requires to cost thousands of dollars at a time to perform the model inference, indicating that in addition to training big models, using these big models is also expensive.

Expertise Barrier

Many deep learning techniques that work well on small models become inefficient or inapplicable on big models. For example, full-parameter fine-tuning has been widely used to tune PTMs for solving specific downstream tasks. If full-parameter fine-tuning is performed to adapt big models for each downstream task, we have to spend terabytes of disk and memory space to store all task-specific fine-tuned models. Moreover, as parameter size grows to billions, some novel characteristics specific to big models emerge, such as in-context learning [26], chain of thoughts [36], and the lottery ticket hypothesis [13]. Hence, novel techniques such as prompt learning [24], delta tuning [9], and MoEfication [42] are specifically developed for big models. These techniques form an expertise barrier to those researchers and practitioners who are interested in big models and do not have extensive experience.

To address these barriers, we introduce OpenBMB, an open-source suite for building big model systems more efficiently, to make big models available to everyone. OpenBMB contains several toolkits for different application scenarios of big models, including BMTrain for efficient training (in Sect. 13.2), OpenPrompt and OpenDelta for efficient tuning (in Sect. 13.3), BMCook for efficient compression (in Sect. 13.4), and BMInf for efficient inference (in Sect. 13.5). Figure 13.1 shows how these toolkits work together to form an effective and efficient big model system. In this chapter, we will introduce these toolkits in detail, covering the critical technologies involved in these toolkits and how to use code to run these toolkits. More details of advanced big model techniques, such as prompt learning and delta tuning, can be found in Chap. 5.

Fig. 13.1
An architecture model is divided into sections of training, tuning, and inference. Training includes data source, raw data, cleaned data, and B M train. Tuning includes a model center classification of C P M 1, 2, and task and prompted data, among others connected to inference with compressed P L M.

An overview of the whole OpenBMB architecture. (The figure comes from the website of OpenBMB https://www.openbmb.cn )

13.2 BMTrain: Efficient Training Toolkit for Big Models

The success of big models is due to large-scale data and huge parameter size. The large-scale data facilitates big models to acquire versatile knowledge during the pre-training stage, while the huge parameter size enables big models to store the acquired knowledge well. However, utilizing large-scale data and huge parameter sizes is a double-edged sword, simultaneously bringing significant performance enhancement and a severe computation barrier. The key to solving the computational barrier of big model training is to adopt more efficient distributed learning strategies to take full advantage of distributed computing clusters. Next, we will introduce the distributed learning strategies used in BMTrain and how to use BMTrain to set these strategies to train big models efficiently.

13.2.1 Data Parallelism

Intuitively, training a big model on a GPU cluster is similar to organizing an event with many partners. Imagine that you have a lot of tasks to do and you want to get them done as quickly as possible; a good solution is to distribute these tasks equally to your partners so that everyone is busy and no one is idle. This is the principle of data parallelism. As shown in Fig. 13.2, the training data is divided into several parts, and each GPU is responsible for training a part of the data. For the model parameters, each GPU stores the whole model. In each iteration, each GPU calculates the gradient for its own part. After all the gradients are computed, the gradients of different GPUs are averaged together as the overall gradient. The overall gradient is then passed to the optimizer to update the model parameters. Finally, the updated parameters are sent back to each GPU for the next iteration. This process is repeated until the training is finished.

Fig. 13.2
A 3-part flow diagram of interconnected G P Us 1, 2, and N with loops of updates. G P U 1 includes data 1, parameters, gradients 1, and average gradients. G P U 2 includes data 2, parameters, gradients 2, and average gradients. G P U N includes data N, parameters, gradients N, and average gradients.

Data parallelism partitions the input data evenly to each GPU before the forward propagation and aggregates the gradients from all GPUs to update the model parameters after the backward propagation

Data parallelism is largely similar to training a model on a single GPU, except that data parallelism adds two additional stages of gradient aggregation and parameter synchronization. Since the training data is evenly partitioned to each GPU, each GPU performs the same amount of computations. As a result, all GPUs can work concurrently to achieve the highest utilization, rather than having some nodes idle while others are under high load.

Data parallelism can effectively utilize large-scale computing power clusters, but some limitations are gradually exposed as the model size increases. Since data parallelism requires keeping the complete model parameters on each GPU, this method is not competent when the number of model parameters exceeds the capacity of one single GPU. For example, for a model with 10 billion parameters, it would take more than 40 GB of memory to store the model parameters and gradients, which is far beyond the capacity of most GPUs. Besides, GPUs with a capacity of 40 GB are expensive and cost inefficient. This parallel strategy needs to be further improved to enable big models to be trained with limited resources.

13.2.2 ZeRO Optimization

ZeRO (zero redundancy optimizer) [29] is a strategy that allows efficient training for models with a parameter size far exceeding the capacity of one single GPU. As mentioned in the data parallelism section, the complete model parameters, gradients, or optimizer states cannot be stored on a single GPU to train big models. As shown in Fig. 13.3, the ZeRO algorithm is proposed to further optimize data parallelism by partitioning the model parameters, gradients, and optimizer states evenly to each GPU and only temporarily aggregating them when needed. During the forward and backward propagations, the model parameters are aggregated twice, and the model gradients are aggregated only once while updating the optimizer states. Therefore, ZeRO needs to communicate three times in each iteration, one more time than data parallelism.

Fig. 13.3
A 3-part flow diagram of interconnected G P Us 1, 2, and N with loops of updates. They have gradients 1, 2, and N connected to parameters and data 1, 2, and N via forward, all gather, backward, and reduce scatter.

ZeRO partitions the model parameters, gradients, and optimizer states evenly to each GPU and only gathers them together temporarily when they are needed for the computation

Although bringing additional communication, the ZeRO algorithm can train big models on a GPU cluster with limited capacity. For example, with ZeRO, it is possible to train a model with 175 billion parameters on 64× A100 GPUs, while data parallelism can only train a model with no more than 3 billion parameters. Empirically, if the number of GPUs is increased indefinitely, the number of model parameters that can be trained using the ZeRO algorithm can also be increased indefinitely.

13.2.3 Quickstart of BMTrain

BMTrain is an open-source toolkit for big model training. It provides easy-to-use interfaces based on PyTorch to help users accelerate the training process using the data parallelism and ZeRO algorithms. For commonly used model architectures such as Transformers, BMTrain implements a series of customized optimizations and makes writing distributed training code just like writing single-node training code. With BMTrain, only four steps are required to modify a single-node training code to drive a distributed cluster for training acceleration.

Step 1: Initializing BMTrain

Since BMTrain is built based on PyTorch, similar to PyTorch calling the function init_process_group at the beginning of code to initialize distributed training, BMTrain requires to use init_distributed at the beginning of code to initialize BMTrain (Fig. 13.4). The function init_distributed also supports users to set random seeds. By setting random seeds, BMTrain can provide users with a reliable and reproducible mechanism to control all random processes, ensuring that multiple runs of the same code can yield stable results.

Fig. 13.4
A snippet of a 7-line code. It includes functions of import b m train, b m t dot i n i t, seed, and zero underscore level, among others.

An example of initializing BMTrain

Step 2: Enabling ZeRO Optimization

After initializing BMTrain, the data parallel and ZeRO algorithms can be applied by making some simple modifications to the single-node training code. As shown in Fig. 13.5, users should modify the model construction code to replace Module and Parameter in PyTorch with DistributedModule and DistributedParameter implemented in BMTrain. To further alleviate memory overhead, wrapping some model layers with CheckpointBlock enables the activation checkpointing mechanism, details of which can be found in the work [6].

Fig. 13.5
2 snippets of an 11-line code and a 12-line code. They include functions of import torch, import b m train, class, d e f, super, self dot param, self dot module underscore list, some transformer block, and b m t dot checkpoint block, among others.

An example of enabling the data parallel and ZeRO algorithms. (a) An example of the single-node training code built on PyTorch. (b) An example of modifying the single-node training code to the BMTrain version

Step 3: Communication Optimization (Optional)

For those model architectures with multiple layers, such as Transformers, BMTrain provides a communication optimization module to reduce the time overhead by overlapping communication and computation. As shown in Fig. 13.6, to enable this optimization, users need to replace ModuleList in PyTorch with TransformerBlockList implemented in BMTrain.

Fig. 13.6
2 snippets of 17-line codes labeled A and B. It includes functions of import, import b m train, class, d e f, super, self dot param, self dot module underscore list, some transformer block, b m t dot checkpoint block, d e f, for, x, and return, among others.

An example of enabling the communication optimization algorithm. (a) An example of using ModuleList to define a multi-layer model in PyTorch. (b) An example of using TransformerBlockList to replace ModuleList

Step 4: Launching the Distributed Training

The code optimized by BMTrain can be run using the same launcher as the PyTorch distributed module. Figure 13.7 shows the distributed training command depending on specific PyTorch versions, and this command should be executed on all nodes in the cluster at the same time. In the launching command, users should specify the communication protocol, including the IP address and port number.

Fig. 13.7
A snippet of a 10-line code. It includes functions of hashtag PyTorch version, python 3 negative m, negative negative n nodes, negative negative master, hashtag PyTorch version, torch run, and negative negative r d z v, among others.

An example of launching the training code train.py in BMTrain

13.3 OpenPrompt and OpenDelta: Efficient Tuning Toolkit for Big Models

In Chap. 5, we have introduced the general capabilities of big models and their excellent performance on a wide range of downstream tasks. In this section, we focus on showing the critical role that prompt learning and delta tuning play in applications and how to adapt big models to downstream tasks using OpenPrompt and OpenDelta. More details of prompt learning and delta tuning can be found in Chap. 5.

13.3.1 Serving Multiple Tasks with a Unified Big Model

Tuning models is an essential part of the practical application of big models, which can adapt the generic capabilities of big models to specific tasks. Due to big models’ high capability and generality, a big model can be adapted to dozens or hundreds of different downstream tasks. Before prompt learning and delta tuning, full-parameter fine-tuning was the mainstream method to adapt big models, where a complete model was fine-tuned and stored for each downstream task. As the parameter size of models increases, the overhead of the full-parameter fine-tuning approach on the memory storage becomes more and more heavy. Moreover, as the number of downstream tasks continues to increase, so does the memory requirement to deploy these models. These issues associated with full-parameter fine-tuning seriously increase the cost of using and deploying big models.

As compared with full-parameter fine-tuning, prompt learning and delta tuning are more friendly to big models, because these two special tuning approaches can adapt big models to different downstream tasks by changing only a small number of parameters or even without changing model parameters, as shown in Fig. 13.8. Such a feature solves the storage problem of big models caused by full-parameter fine-tuning and further reduces the problem of heavy resource consumption when deploying multiple models for different tasks. In addition, both prompt learning and delta tuning usually design task instructions in the form of natural languages to stimulate the knowledge of big models to solve specific problems. Compared with using a programming language to control big models, using task instructions to control big models is more human-friendly.

Fig. 13.8
A flow diagram of the delta center. It includes connected task delta 1, 2, and 3, tasks 1 to 3, requests 1 to N, frozen big model, and results 1 to N.

Big models are fixed on the cluster. Different delta objects are dynamically loaded for different tasks with almost no overhead per task

Based on prompt learning and delta tuning, we can deploy a big model on a GPU cluster and dynamically task-specific task instructions or delta objects rather than task-specific models. These task instructions and delta objects can cooperate with the big model to handle multiple downstream tasks. In this way, the resources consumed during deployment will not grow with the increase of downstream tasks. Aggregating all tasks into the same cluster also brings higher utilization, avoiding the cluster being partially idle due to unbalanced task requests.

13.3.2 Quickstart of OpenPrompt

OpenPrompt is an open-source prompt learning framework with high extensibility. OpenPrompt supports a variety of mainstream prompt learning methods [21, 30] and can help users more easily apply prompt learning on existing models or develop new models. As shown in Fig. 13.9, a PromptModel consists of a PTM as the backbone, one or more Templates to wrap the raw input with task instructions, and one or more Verbalizers to map the task labels to the vocabulary of PTMs. Developing a prompt learning pipeline using OpenPrompt requires the following steps to build a PromptModel.

Fig. 13.9
A 2-part flow diagram of the prompt dataset and prompt model connected to the prompt trainer. Dataset includes a flow of dataset, template via example, and prompt tokenizer via wrapped example. Model includes a flow of template embeddings, P T Ms via input for P T Ms, and verbalizer via logits.

The framework of OpenPrompt. (The figure is re-drawn according to Fig. 1 from OpenPrompt paper [8])

Step 1: Defining the Task

The first step in using OpenPrompt is defining the details of the task. Taking sentiment classification as an example, which aims to judge whether the input sentence is positive or negative, Fig. 13.10 shows how to define the dataset used for the prompt learning of the sentiment classification task.

Fig. 13.10
A snippet of a 22-line code. It includes functions of import, from, hashtag there are two classes, classes, negative within double quotes, positive within double quotes, hashtag for simplicity, hashtag text underscore a, dataset, and input example, among others.

An example of defining the task for prompt learning in OpenPrompt

Step 2: Loading the PTM

A PTM is the backbone of PromptModel. OpenPrompt supports directly loading models obtained from some online model hubs, such as ModelCenterFootnote 1 built on BMTrain and Huggingface TransformersFootnote 2 built on PyTorch (Fig. 13.11).

Fig. 13.11
A snippet of a 6-line code. It includes functions of import, from, hashtag here we use BERT as an example, p l m, tokenizer, and model underscore config, among others.

An example of loading the PTM for prompt learning in OpenPrompt

Step 3: Defining the Task Instruction

At least one task instruction is required to make a PromptModel. The Template is used to modify the original input with the defined task instruction. Figure 13.12 shows how to join the task instruction It was and the field text_a of the dataset defined in Step 1.

Fig. 13.12
A snippet of an 8-line code. It includes functions of import, from, prompt template, text, and tokenizer, among others.

An example of defining the task instruction for prompt learning in OpenPrompt

Step 4: Defining the Verbalizer

At least one verbalizer is required for PromptModel. Verbalizer is an important part of prompt learning that maps the output words to the task labels. Figure 13.13 maps the word bad to the negative label and maps good, wonderful, and great to the positive label.

Fig. 13.13
A snippet of a 12-line code. It includes functions of import, from, prompt verbalizer, classes, label underscore words, and tokenizer, among others.

An example of defining the verbalizer for prompt learning in OpenPrompt

Step 5: Combining Different Modules into a PromptModel

Based on the PTM, task instruction, and verbalizer obtained earlier, we can define a PromptModel using the code in Fig. 13.14.

Fig. 13.14
A snippet of a 9-line code. It includes functions of import, from, prompt model, template, p l m, and verbalizer, among others.

An example of building the PromptModel for prompt learning in OpenPrompt

Step 6: Defining the PromptDataLoader

In order to learn the defined prompt, we also need a dataloader to sample data from the dataset. The PromptDataLoader in Fig. 13.15 is an extension of the PyTorch dataloader used to sample data for prompt learning.

Fig. 13.15
A screenshot of a 10-line code. It includes functions of import, from, data underscore loader, dataset, tokenizer, template, and tokenizer underscore wrapper underscore class, among others.

The example of building the dataloader for prompt learning in OpenPrompt

Step 7: Training and Inference

After defining the PromptModel and PromptDataLoader, we can train and infer the defined prompt. All the code for training and inference can be implemented with PyTorch. Figure 13.16 represents an inference example based on OpenPrompt.

Fig. 13.16
A screenshot of a 10-line code. It includes functions of import, prompt model, hashtag the predictions would be 0 and 1, with torch, for, logits, p r e d s, and print, among others.

An example of model inference based on prompt learning in OpenPrompt

13.3.3 QuickStart of OpenDelta

OpenDelta is an open-source delta tuning toolkit that can perform delta tuning without modifying the code of the backbone PTM. By using OpenDelta, users could easily implement prefix tuning [23], adapter tuning [20], LoRA [17], or any other types of delta tuning. OpenDelta also supports sharing delta objects so that users can load the delta objects trained by others or save and publish users’ own delta objects. To adapt PTMs using OpenDelta only needs the following three steps.

Step 1: Loading the PTM

Similar to OpenPrompt, OpenDelta requires a PTM to be loaded first as the backbone for subsequent tuning. The code in Fig. 13.17 shows how to load BART [22].

Fig. 13.17
A snippet of a 5-line code. It includes functions of from, import, hashtag load BART as the backbone, and model, among others.

The example of loading the PTM for delta tuning in OpenDelta

Step 2: Adding the Delta Object

After loading the PTM, OpenDelta requires users to specify the parameters to be tuned (i.e., delta object). The code in Fig. 13.18 shows how to use the built-in modifications (i.e., adapter layers) in OpenDelta to specify the parameters that require to be tuned.

Fig. 13.18
A snippet of a 6-line code. It includes functions of import, from, hashtag this will apply adapter tuning, and delta underscore model, among others.

The example of specifying the delta object for delta tuning in OpenDelta

Step 3: Freezing the Backbone

After adding the delta object, the parameters of the backbone model are not automatically frozen. Users need to use the freeze_module method provided by OpenDelta to freeze the backbone model (Fig. 13.19).

Fig. 13.19
A snippet of a 5-line code. It includes functions of import, hashtag freeze the parameters of the backbone, and delta under score model, among others.

The example of freezing the backbone for delta tuning in OpenDelta

After completing the above three steps, the model can be trained with the training script. After the training is completed, users can also use the state_dict interface to obtain the delta object’s parameters, similar to obtaining parameters in PyTorch. The delta object obtained using OpenDelta usually only needs very little space to store, which is very space-efficient and can be easily stored and shared with others.

13.4 BMCook: Efficient Compression Toolkit for Big Models

The research on model compression started long ago, and in recent years several vital directions, such as quantization [2, 3, 31, 32], distillation [16, 19, 28, 33], and pruning [5, 10, 14, 35, 37, 38, 43], have been widely explored. Before the emergence of big models, model compression techniques were mainly applied to adapt models to various low-resource end devices, such as cell phones and cameras, or to some real-time applications that require low-latency inference. After the popularity of big models, the inference of these big models requires more expensive high-end devices than conventional deep models, making the compression process more critical.

In order to make big models run on common devices like various consumer GPUs, we have to combine different compression techniques to minimize the parameter size of these models without degrading the model performance too much. Therefore, as shown in Figure 13.20, BMCook systematically implements model quantization, model distillation, model pruning, and model MoEfication. To take full advantage of these compression techniques, BMCook builds a unified compression framework to support the combination of these compression techniques. Under this unified framework, BMCook can support arbitrary combinations of different compression techniques. Next, we will introduce typical model compression techniques and how to use these techniques to compress big models with the help of BMCook. And more details about BMInf can be found in the paper [41].

Fig. 13.20
A 4-part flow diagram of quantization, distillation, pruning, and MoEfication. They include self-attention and F F N layers.

The unified compression framework of BMCook [41]. (The figure is re-drawn according to Fig. 1 from BMCook paper [41])

13.4.1 Model Quantization

Model quantization aims to represent the parameters of big models with those lower-bit data types rather than the commonly used 32-bit floating-point types. By using lower-bit data types to represent model parameters, both the memory and computation costs can be significantly reduced. For example, representing a model with an 8-bit fixed-point type is 4 times faster than representing the same model with a 32-bit floating-point type.

The widely used model quantization methods mainly follow two paradigms to quantize model parameters: post-training quantization (PTQ) and quantization-aware training QAT. PTQ [3] aims to directly quantize model parameters after the model learning is completed. PTQ is simple, but it may bring significant performance degradation since lower-bit data types simultaneously bring efficiency improvement and accuracy degradation. QAT [32] is proposed to alleviate the degradation caused by quantization. Specifically, QAT simulates the quantization process during learning models so that model parameters can be quantized with the guide of training data. With linear layers as an example, QAT replaces all linear layers of models with quantized linear layers. In quantized linear layers, the matrix multiplication operation is replaced by the quantized matrix multiplication.

Existing popular deep learning frameworks, such as PyTorch and TensorFlow, have already supported PTQ. Considering this and toward better performance, BMCook mainly adopts QAT to compress models.

13.4.2 Model Distillation

Model distillation aims to transfer model knowledge from larger teacher models to smaller student models. Conventional distillation methods mainly focus on adding the KL divergence between the output results of teacher models and those of student models as an additional training objective [16], so that student models perform similarly to teacher models.

After the emergence of PTMs, making the inner computation between student models and teacher students closer facilitates distilling PTMs [19, 28, 33]. Specifically, these distillation methods add the MSE loss between student and teacher models’ hidden states. Note that distillation methods only provide additional training objectives rather than directly reducing the parameter size. Owing to this, any other compression methods can be combined with distillation methods to further improve the performance of compressed models.

Both the distillation methods based on the KL-divergence and MSE loss functions are implemented in BMCook.

13.4.3 Model Pruning

Model pruning is widely used to prune redundant model parameters. The existing model pruning methods mainly follow two paradigms to prune model parameters: structured pruning and unstructured pruning. Structured pruning aims to remove complete redundant modules such as model layers [10, 35, 37, 43], yet unstructured pruning focuses on removing individual parameters [5, 14, 38]. Since the pruned parameters do not need to be stored in memory and are also not involved in model computation, model pruning can thus reduce the requirements of memory and computing power.

In BMCook, both structured pruning and unstructured pruning are supported. However, unstructured pruning cannot guarantee efficiency gain since existing parallel processing devices (e.g., GPUs) do not sufficiently support arbitrary sparse computation operations [44]. Considering that 2:4 sparsity is well supported by Sparse Tensor Core [45], BMCook thus implements the unstructured pruning with 2:4 sparsity, i.e., BMCook forces every four continuous parameters to have two zeros. In this way, BMCook can guarantee that its sparse operations can be at least twice as fast as dense ones.

13.4.4 Model MoEfication

Since Transformers adopt ReLU [27] as the activation function of the feedforward networks (FFNs), bringing a sparse activation phenomenon, we can only use the part of FFNs for a specific input without affecting the model performance. To this end, model MoEfication [42] is proposed to transform Transformers to the mixture-of-expert (MoE) versions [11], which can significantly reduce the computation costs of Transformers. Model MoEfication only selects parts of model parameters for computation rather than changing or removing model parameters. To this end, model MoEfication can be viewed as a post-processing technique that can be applied to an already compressed model to further improve efficiency.

13.4.5 QuickStart of BMCook

Compressing a model with BMCook requires to set concrete compression strategies using a configuration file. The configuration file includes hyper-parameter settings, model configurations, and which compression methods to use for model compression. Figure 13.21 is an example of the configuration file. In the configuration file, we can set whether to use each compression method or not and determine which modules of the model each compression method is used for.

Fig. 13.21
A snippet of a 21-line code. It includes functions of distillation, c e underscore scale, m s e underscore h i d n underscore scale, pruning, is underscore pruning, pruning underscore mask underscore path, quantization, is underscore quant, MoEfication, and is underscore moefy, among others.

An example of the configuration file of BMCook

After setting up the configuration file, some code needs to be modified to drive BMCook to compress the model. Next, we will introduce how to use BMCook to run each compression method.

Model Quantization

As shown in Fig. 13.22, to perform QAT in BMCook, we only need to use the function BMQuant.quantize to operate the given model. The function will automatically replace all linear modules in the model with QAT linear modules.

Fig. 13.22
A snippet of a 6-line code. It includes functions of import, from, hashtag config is the configuration file, and B M quant, among others.

An example of performing model quantization in BMCook

Model Distillation

As shown in Fig. 13.23, to perform model distillation in BMCook, we need to use the function BMDistill.set_forward to combine the distillation loss function with the original loss function of the student model.

Fig. 13.23
A snippet of a 7-line code. It includes functions of import, from, hashtag config, hashtag forward underscore f n, and B M distill, among others.

An example of performing model distillation in BMCook

Model Pruning

In BMCook, model pruning is implemented based on the pruning masking mechanism, which requires to modify the optimizer to freeze those masked parameters. As shown in Fig. 13.24, model pruning in BMCook can be performed by using BMPrune.compute_mask and BMPrune.set_optim_for_pruning, which can help the model compute the pruning masking matrix and modify the optimizer.

Fig. 13.24
A snippet of a 10-line code. It includes functions of import, from, hashtag config, hashtag set the pruning masking matrix, B M prune, and hashtag modify, among others.

An example of performing model pruning in BMCook

Model MoEfication

Model MoEfication is to transform a dense model into a sparse one. It requires the hidden states during the forward propagation to learn how to select experts. As shown in Fig. 13.25, to perform MoEfication it is a need to use BMMoE.get_hidden.

Fig. 13.25
A snippet of a 7-line code. It includes functions of import, from, hashtag config, hashtag retain the hidden states, and B M M o E dot get underscore hidden, among others.

An example of performing model MoEfication in BMCook

13.5 BMInf: Efficient Inference Toolkit for Big Models

How to achieve efficient model inference has long been of great interest to the industry. ONNX,Footnote 3 proposed in 2017, is a general model representation format compatible with various mainstream deep learning frameworks such as PyTorch and TensorFlow. Owing to the excellent compatibility of the ONNX format, many tools for accelerating model inference based on the ONNX format have sprung up. The most well-known toolkits are TensorRTFootnote 4 and onnxruntime.Footnote 5 Up to now, transforming big models to the ONNX format and then using TensorRT or onnxruntime for model inference have become a widely used paradigm in the industry. FasterTransformerFootnote 6 is released by NVIDIA in 2019, which is a toolkit for the efficient inference of Transformer-based models on CUDA devices, considering that recently proposed big models are mainly based on Transformers.

Although these toolkits for accelerating model inference have achieved promising results, they cannot meet inference requirements for those models with more than 10 billion parameters. The main bottleneck of big model inference is GPUs’ computing power and capacity. For a model with 10 billion parameters, at least 20 GB of memory is required to store the model, and a throughput of over 1013 FLOPs is also required to reach a usable inference speed. Such requirements are beyond the computing power and capacity of most GPUs. To make it possible to run big models on consumer GPUs, we need to leverage more heterogeneous devices such as CPUs and RAMs. Moreover, we also need to apply quantization and memory scheduling techniques to reduce the computing power and memory requirements of big models. These techniques, which need to be better supported, are well integrated into BMInf. Next, we will introduce these inference techniques and how to use BMInf for model inference. And more details about BMInf can be found in the paper [15].

13.5.1 Accelerating Big Model Inference

Optimizing the efficiency of model inference is usually closely related to specific hardware platforms. Since CUDA-based GPUs are now widely used in academia and industry, we focus on introducing inference acceleration techniques around the GPUs based on the CUDA architecture.

One of the most common optimizations is kernel fusion [1, 12, 34]. In the CUDA architecture, the basic operation unit is the kernel, and each kernel needs to read data from and write results back to global memory. Complex operators of neural networks usually require multiple kernels to cooperate, leading to redundant computation and storage. For example, for batch normalization, if it is implemented as the combination of multiplication and addition kernels, the intermediate variables need to be written back to global memory. However, using kernel fusion to integrate the multiplication and addition kernel into one kernel, the intermediate variables can be stored in registers rather than global memory, which would be much faster (Fig. 13.26).

Fig. 13.26
A diagram of batch normalization, without kernel fusion, and with kernel fusion. Normalization includes an equation of y = alpha x + beta. Without kernel fusion includes load, multiplication, save, load, add, and save. With kernel fusion includes load, multiplication, add, and save.

The example of batch normalization. As shown in this figure, kernel fusion saves time in accessing global memory and improves efficiency

Another common optimization is model quantization. In Sect. 13.4, we have described how to use BMCook for model quantization, and we thus will not introduce more quantization details here. For CUDA-based devices, using shorter data types (i.e., lower-bit data types) is usually much faster than using longer ones. Therefore, model quantization can significantly improve the efficiency of model inference. Using different quantization strategies may result in different running efficiencies for different models. For example, for a model based on Transformers, the main time overhead is in the computation of its linear layers, and quantizing the linear layers of this model will lead to efficiency improvements. Such a quantization strategy is implemented in BMInf.

13.5.2 Reducing the Memory Footprint of Big Models

Due to the huge parameter size, only using GPUs for big model inference requires each GPU to have a huge memory capacity. This requirement raises the threshold for applying big models to specific tasks. In order to run big models on consumer GPUs with limited memory capacities, it is critical to take advantage of CPUs to alleviate the requirement for GPU storage capacity. Specifically, when performing big model inference, we can put the part of parameters that will not be used in a short time on CPUs instead of storing all parameters on GPUs all the time. These parameters stored on CPUs will be transferred to GPUs when they need to participate in the computation.

Note that such storage optimization comes with a price. The frequent passing of parameters between CPUs and GPUs requires non-negligible additional communication time. To this end, along with using both GPUs and CPUs, CPU-GPU scheduling is also applied to overlap the time of passing parameters and model computation. By using the CPU-GPU scheduling, the parameters that GPUs will use soon can be transferred from CPUs to GPUs beforehand.

Based on accelerating big model inference and reducing the memory footprint of big models, it is possible to run extremely big models on a single small-capacity GPU with limited computation overhead. BMInf implements kernel fusion, model quantization, and automatic CPU-GPU scheduling at the layer level. Table 13.1 shows the performance of BMInf running a big model with 10 billion parameters on different GPUs. From the table, we can see that combining all the above techniques enables us to run a big model even on a single GPU with only 6 GB capacity.

Table 13.1 The inference performance of the big model with 10 billion parameters on different GPUs using PyTorch with BMInf

13.5.3 QuickStart of BMInf

As shown in Fig. 13.27, BMInf is an out-of-the-box toolkit that requires only minor modifications to the original model inference code. After implementing the model requiring to perform inference with PyTorch, users only need to load the parameters of the model on the CPU and then use the wrapper function of BMInf to wrap the model. The wrapped model can automatically apply optimization techniques according to specific hardware conditions during the inference process without manual intervention. As shown in Table 13.1, just with a few lines of code, BMInf can help a big model achieve inference speeds far exceeding its PyTorch implementation on consumer GPUs.

Fig. 13.27
A snippet of a 12-line code. It includes functions of import, hashtag use PyTorch, model, hashtag load the parameters, model dot load underscore state underscore d i c t, hashtag apply the wrapper, and with torch dot cuda dot device, among others.

Using the wrapper function to wrap the model can drive BMInf to optimize model inference

13.6 Summary and Further Readings

In this chapter, we review the progress of big models in NLP and highlight the difficulties and corresponding solutions to run big models for practical applications. Then, we introduce how to use the toolkits of OpenBMB to achieve efficient training, tuning, compression, and inference. To learn more about the latest progress of big models, please follow BMList,Footnote 7 which contains information about the latest release of big models and is very helpful for readers to understand the trends of big models. We also recommend finding more details on efficient and efficient big models in Chap. 5.