#MLefficiency — Optimizing transformer models for efficiency

Since models are growing in the number of parameters, it increases the computational requirements at the same time. However, during inference, you want to allocate as less compute resources as possible for a service to run 24/7, simply to keep the operational costs low. Therefore, an active field of research is to reduce the model compute requirements in terms of computing for inference while keeping the original model performance.

This post presents one example of how to reduce the model size. It is based on the paper QuaLA-MiniLM: a Quantized Length Adaptive MiniLM. The paper was published in 2022 by Intel and accepted for NeurIPS 2022. It focuses on the problem of computational costs during the inference time of transformer-based language models.

The QuaLA-MiniLM model achieves a speedup of up to 8.8 times compared to the original model. To transform the original model to the optimized version, there are 3 steps necessary:

MiniLM Distillation
Length Adaptive Transformer
Quantization

Each step is presented in more detail below.

MiniLM Distillation

MiniLM is based on a teacher-student model. It uses deep self-attention distillation to generate a smaller version of the Language Model (LM) with only a small loss in performance.

The first step, the student to learn the self-attention module of the last layer of the teacher model which contains rich linguistic information. This can be achieved by reducing the KL divergence between the student and teacher self-attention module. It is called Attention Transfer (see the image below).

On the left side is the teacher, the original BERT model, and on the right side the student. The student only focuses on reproducing the last layer of information.

The second step is called Value-Relation Transfer. Basically, it uses the values of the self-attention module of the teacher and the student as additional guidance during training. The value relation is the resulting loss of the KL-divergence between the scaled dot-product between the values of the student and teacher.

Following the original paper, if the difference between the number of the teacher and student for layers and the hidden size is more than 50%, it makes sense to have a teacher assistant model first and train the student supervised by the teacher assistant.

Length Adaptive Transformer (LAT)

Length Adaptive Transformer is the second building block to reduce the computational footprint of a language model. It builds on the foundation of the PoWER-BERT model. PoWER-BERT in the original form uses only the embeddings for tokens with the highest significance. The significance is the amount of attention imposed by a one-word vector on the other word vectors. Keep in mind that PoWER-BERT has the same model parameters as BERT. However, the technique of PoWER-BERT reduces the inference time. There are cases where PoWER-Bert can be more accurate and more efficient than BERT. Too large models tend to overfit. You find a visualization of PoWER-Bert below.

PoWER is a special version of the traditional BERT model. (yellow boxes: output embedding layer; blue boxes: output transformer layer; red boxes: task-specific layer)

Length Adaptive Transformer expands this idea with LengthDrop and LayerDrop. LengthDrop randomly removes tokens during the training phase. Before each Stochastic Gradient Decent (SGD) update, every layer is a length defined based on the previous length with a LengthDrop probability p. LengthDrop probability p describes the probability of an output token being dropped. It can be compared to dropout in classical Neural Networks. The image below illustrates what Drop-and-Restore looks like.

In addition, there is also LayerDrop which drops complete layers during the training phase. This enforces the tokens to be layer agnostic. Reducing the number of layers would improve the computational requirements, too. However, more importantly, it allows for the dropped tokens to be restored in the last layer.

This graphics shows the concept of drop-and-restore (yellow boxes: output embedding layer; blue boxes: transformer layer; red boxes: task-specific layer, green boxes: dropped-and-restored output). The agnostic layer output allows the restoration of intermediate output in the last layer.

However, to make it work in reality, there are a few more tricks necessary. One is the sandwich rule and inplace distillation. The sandwich rule describes an update process. In every training step, the full model is updated with the supervised loss function. Additionally, in every training loop, a set of randomly sampled sub-models (the sandwiches) and the smallest possible sub-model are trained using knowledge distillation from the full model. That means that the sub-models are trained to predict the output of the full model with a cross-entropy loss. To be clear: This means there is a training overhead at this point. The length-adaptive transformer is only more efficient during inference time.

After the training process, there is not only a single model but a whole set of models (all possible sub-models). Therefore, these sub-models can be analyzed and compared. The paper suggests using an evolutionary search to find a model with the best trade-off between accuracy and computational requirements or limitations, respectively.

Quantization

Quantization is a well-known method used for a long time now. Basically, it is possible to reduce the precision used for the model weights from e.g. 32bit float to an 8bit representation. This not only reduces memory consumption but at the same time reduces computational effort.

There are different techniques for quantization, the paper is using zero-point quantization. Zeropoint quantization is shifting the distribution in the interval [-127, 127] by normalizing and shifting the zero-point. This affine transformation ensures that all 8 bits are used and therefore as much information as possible can be preserved.

Summary — QuaLA-MiniLM

QuaLA-MiniLM combines these three elements from above to speed up a language model drastically. But to achieve the best possible compromise between accuracy and computational performance, all steps already focus on the final task (see the image below). Therefore, a task-specific dataset is necessary for this optimization.

All three building blocks of QuaLA-MiniLM are visualized in order. Important to note here ist the fact that all blocks include working on a task-specific dataset.

This means, this increase in efficiency comes with two main drawbacks:

It needs more computational power during training.
It is a task-specific model, not a generic language model.

However, the results show are very promising. In this specific case, QuaLA-MiniLM speeds up the model compared to the original BERT model by a factor of 8.8 with a drop in accuracy by less than 1 percent. The latency times reported in the table below were tested on an Intel Xeon 8280 system.

With all optimizations, QuaLA-MiniLM achieves an 8.8 times speedup compared to the original BERT version. At the same time, the memory footprint reduces by roughly a factor of 8.

Are you interested in efficiency optimization for large transformer models? Then have a look at my other articles. And if you have questions or comments, feel free to add them to the comment section. Thanks!

#MLefficiency — Optimizing transformer models for efficiency

MiniLM Distillation

Length Adaptive Transformer (LAT)

Quantization

Summary — QuaLA-MiniLM

More from Kawa

Get the Medium app

Kawa