Benchmarking BERT-based models for text classification.

Recent achievements in NLP have given rise to innovative model architecture like GPT-3 and Bert. These pre-trained models have democratized machine learning, allowing even people with small data science experience to build ML applications, without training a model from scratch and utilizing huge calculating resources.

In this post, We focus on the most popular BERT models for text classification which demonstrate outstanding performance. Large-scale transformer-based language models for example GPT-3, which is trained on 175 billion parameters and 470 times bigger in size than BERT-Large are not considered in this article.

We explore and compare the most popular architectures (for classifying text into multiclasses on a dataset for classifying emotions. Dataset can be downloaded here: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp. The models were trained with the same hyperparameters as follows:

batch_size = 8,

gradient_accumulation_steps=4,

max_grad_norm=5,

warmup_proportion=0.1,

learning_rate=5e-5

All models were trained on early stopping conditions with patience parameter = 5( except RASA DIETClassifier). In the absence of improvement for models on training with 5 epochs, training was stopped and the model with the minimum value of the loss function on tests data was selected for final quality assessment. The maximum number of training epochs has been set to 20.

RASA DIETClassifier — a transformer-based model that performs both as an entity extractor and intent classifier.

Regarding the RASA DIETClassifier it was used prepared RASA config with the following parameters:

- name: WhitespaceTokenizer

- name: RegexFeaturizer

- name: LexicalSyntacticFeaturizer

- name: CountVectorsFeaturizer

- name: CountVectorsFeaturizer

analyzer: char_wb

min_ngram: 1

max_ngram: 4

- name: DIETClassifier

epochs: 20

constrain_similarities: true

It is important to highlight, that the RASA DIETClassifier was trained on the CPU whereas the rest of the models were trained on a GPU.

mBERT — Multilingual BERT transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output). For our benchmark, We exploit the bert-base-multilingual-uncased model from the transformers library.

https://gist.github.com/kedisho/12da3858eec023cfeb4a4c29bd53b377

DistilBERT — DistilBERT transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output). For our benchmark, We exploit the distilbert-base-cased from the transformers library.

https://gist.github.com/kedisho/dc2e8ace3af3e49b4b2c4b415adfa4f3.js

XLM-RoBERTa — Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output). For our benchmark, We exploit the xlm-roberta-base from the XLMRobertaTokenizerFast library.

https://gist.github.com/kedisho/e1619bca46f932341ee84684bbbbd6f4.js

LaBSE sentence-transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output). For our benchmark, We exploit the sentence-transformers/LaBSE model from the transformers library.

https://gist.github.com/kedisho/f7cbc60d21ccadda2d59bee3c9853d52.js

While building a benchmark for estimating models We decided to use a set of metrics that evaluate both quality and resource requirements in the process of training and testing the models.

For each model to evaluate the quality of classification, we built a Confusion matrix and determined the F1-score, Precision, and Recall.

https://gist.github.com/kedisho/48e952222a4a6a3506a783636e7dd746.js

Training state. Performance metrics:

Training duration 1 epochthe duration of training during one epoch in seconds on the GPU and for RASA DIETClassifier on the CPU.

Training duration (early stopping)the duration of the training, taking into account the criterion of early stopping.

RAM using peak — peak memory consumption during training, measured using memory_profiler at intervals of 0.1 seconds, in MiB.

RAM using mean — the average memory consumption during training, measured using memory_profiler at intervals of 0.1 seconds memory_profiler, in MiB.

GPU RAM using — the amount of allocated memory in the GPU, determined using monitoring gpustat –watch.

GPU RAM using — the amount of allocated memory in the GPU, determined using monitoring gpustat –watch.

Inference state. Performance metrics:

One example latencythe time it takes to get inference from the model from the moment the raw data is passed to the pipeline, measured using the %%prun tool in jupyter-notebook for the same “I am fine” text for all

RAM using peakpeak memory consumption during inferencing, measured using memory_profiler at intervals of 0.1 seconds, in MiB

RAM using meanaverage memory consumption during the inferencing process, measured using memory_profiler at intervals of 0.1 seconds, in MiB

CPU usingnumber of functions called per CPU, measured with jupyter-notebook’s tool %%prun

As a result, We got the final benchmark for 5 most popular Bert models:

Based on the suggested metrics We can make the following conclusions:

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store