Dec 15

4 min read

Benchmarking BERT-based models for text classification.

Recent achievements in NLP have given rise to innovative model architecture like GPT-3 and Bert. These pre-trained models have democratized machine learning, allowing even people with small data science experience to build ML applications, without training a model from scratch and utilizing huge calculating resources.

In this post, We focus on the most popular BERT models for text classification which demonstrate outstanding performance. Large-scale transformer-based language models for example GPT-3, which is trained on 175 billion parameters and 470 times bigger in size than BERT-Large are not considered in this article.

We explore and compare the most popular architectures (for classifying text into multiclasses on a dataset for classifying emotions. Dataset can be downloaded here: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp. The models were trained with the same hyperparameters as follows:

batch_size = 8,

gradient_accumulation_steps=4,

max_grad_norm=5,

warmup_proportion=0.1,

learning_rate=5e-5

All models were trained on early stopping conditions with patience parameter = 5( except RASA DIETClassifier). In the absence of improvement for models on training with 5 epochs, training was stopped and the model with the minimum value of the loss function on tests data was selected for final quality assessment. The maximum number of training epochs has been set to 20.

RASA DIETClassifier — a transformer-based model that performs both as an entity extractor and intent classifier.

Regarding the RASA DIETClassifier it was used prepared RASA config with the following parameters:

- name: WhitespaceTokenizer

- name: RegexFeaturizer

- name: LexicalSyntacticFeaturizer

- name: CountVectorsFeaturizer

analyzer: char_wb

min_ngram: 1

max_ngram: 4

- name: DIETClassifier

epochs: 20

constrain_similarities: true

It is important to highlight, that the RASA DIETClassifier was trained on the CPU whereas the rest of the models were trained on a GPU.

mBERT — Multilingual BERT transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output). For our benchmark, We exploit the bert-base-multilingual-uncased model from the transformers library.

https://gist.github.com/kedisho/12da3858eec023cfeb4a4c29bd53b377

DistilBERT — DistilBERT transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output). For our benchmark, We exploit the distilbert-base-cased from the transformers library.

https://gist.github.com/kedisho/dc2e8ace3af3e49b4b2c4b415adfa4f3.js

XLM-RoBERTa — Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output). For our benchmark, We exploit the xlm-roberta-base from the XLMRobertaTokenizerFast library.

https://gist.github.com/kedisho/e1619bca46f932341ee84684bbbbd6f4.js

LaBSE sentence-transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output). For our benchmark, We exploit the sentence-transformers/LaBSE model from the transformers library.

https://gist.github.com/kedisho/f7cbc60d21ccadda2d59bee3c9853d52.js

While building a benchmark for estimating models We decided to use a set of metrics that evaluate both quality and resource requirements in the process of training and testing the models.

For each model to evaluate the quality of classification, we built a Confusion matrix and determined the F1-score, Precision, and Recall.

https://gist.github.com/kedisho/48e952222a4a6a3506a783636e7dd746.js

Training state. Performance metrics:

Training duration 1 epoch — the duration of training during one epoch in seconds on the GPU and for RASA DIETClassifier on the CPU.

Training duration (early stopping) — the duration of the training, taking into account the criterion of early stopping.

RAM using peak — peak memory consumption during training, measured using memory_profiler at intervals of 0.1 seconds, in MiB.

RAM using mean — the average memory consumption during training, measured using memory_profiler at intervals of 0.1 seconds memory_profiler, in MiB.

GPU RAM using — the amount of allocated memory in the GPU, determined using monitoring gpustat –watch.

GPU RAM using — the amount of allocated memory in the GPU, determined using monitoring gpustat –watch.

Inference state. Performance metrics:

One example latency — the time it takes to get inference from the model from the moment the raw data is passed to the pipeline, measured using the %%prun tool in jupyter-notebook for the same “I am fine” text for all

RAM using peak — peak memory consumption during inferencing, measured using memory_profiler at intervals of 0.1 seconds, in MiB

RAM using mean — average memory consumption during the inferencing process, measured using memory_profiler at intervals of 0.1 seconds, in MiB

CPU using — number of functions called per CPU, measured with jupyter-notebook’s tool %%prun

As a result, We got the final benchmark for 5 most popular Bert models:

Based on the suggested metrics We can make the following conclusions:

For the quick reaction (for example in cases for chatbots) where a quick response of the model is required, the best option is to use the RASA DIETClassifier, but you need to take into account the lower quality of the trained model. Also, the undeniable advantage is that it can be trained on the CPU with limited resources relatively fast.
The DistilBERT Classifier loses a little in speed than the RASA DIETClassifier, but this model makes significantly fewer errors.
For the services which don’t require quick response time, mBERT Classifier and XLM-RoBERTa are the most recommended candidates concerning their performance characteristics and their memory consumption requirements.
The LaBSE Classifier is the most accurate model considered in our research, but it requires a significant amount of memory resources to operate with. For some experiments and production implementations, it can be.

Benchmarking BERT-based models for text classification.

More from Dmitry

Get the Medium app

Dmitry