Transformers more than meets the eye!

Transformer networks is the novel and fastest sequence models in use. Let us dive in a bit and learn why it is faster than the other sequence models and why it is commonly used.

The following information is got from MIT 6.S191 class and a video on youtube named: Illustrated Guide to Transformers Neural Network: A step by step explanation , also the sequence sample is taken from the very same youtube video.

Please have fun :)

TRANSFORMER NETWORKS BASICS:

Transformer networks was firstly seen on the paper “All you need is attention” Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. dating back to 2017.

Although this paper focuses on natural language tasks , this model also could be applied to other types of data such as images. For example Dall.e by openAI creates images from text descriptions.

The prior models used for sequence processing were RNNs and/or LSTMS. Problem with these type of networks are:

  • the vanishing gradient problem of the recurrent networks
  • the long training times as the recurrent network loops takes one input at a time and does not allow parallelization. All GPUs designed for AI are designed for multi tasking and they are more suitable to work with feed forward networks and not the recurrent networks.

Hence, sequence processing training phase with RNN type networks takes longer times and gives poorer results as vanishing gradients problem. as the input gets longer the output accuracy gets poorer.

This paper focuses on how humans act when they process sequences. As humans just by looking at a picture we can identify which object on the picture is the most important . How human brain knows which parts to attend to on a picture ? Or how it knows how to extract the features with high attention ?

The goal is to identify and attend to most important features in the input. In order to do that

  • We need to preserve the order information or the position information without processing the words in a sentence individually.
  • Extract query, key, value information from input data and try to generate a mechanism for self attention. That means we are going to operate on the input itself.
  • Compute the attention weighting, to do that we are focusing on the similarity between the query and the key.

The following is an image of the model structure. If you are learning neural networks you are probably familiar with most of the modules inside this model (eg: linear module, feed forward network, softmax etc.) The new modules could be the positional encoding and multi headed attention blocks.

The transformer model
Transformer network model structure

Let’s dig more and analyse the Encoder part.

ENCODER

This module generates continuous vector representations holding all the information about the input sequence. Then the decoder part takes this output and generates a single word output at a time. This newly generated decoder output will be added to the input list of the decoder in order to make the future predictions logical.

Let us try to create a chatbot , and the input to the network will be “hi, how are you” , and we expect it to answer with “I am fine”.

INPUT EMBEDDING MODULE

Lets check the input embedding layer :

Neural networks learn through numbers so each word maps to a vector of numbers.

Embedding vectors for the words in the input sequence.
Word embedding vectors

For a given sequence we have vectors for each word (assuming it is a natural language sequence network).

POSITIONAL ENCODING

For each word in the input we generate a word embedding vector. These embedding vectors then added up with position vectors for each of the words. The output is a new vector named “positional input embeddings”. Sine and cosine functions are used to create this embedding vectors. This position information is cruital as transformers is not a recurrent network and needs to be fed with position information.

Positional embedding vector calculation
Computing positional word embedding vectors

After embedding vectors are summed up with the positional vectors , these vector outputs or simply the positional embedding vectors goes through the Encoder. Encoder consists of two modules one is the multi headed attention module and the other is the fully connected feed forward network. Both are followed by a normalization layer. Multi headed attention module actually makes the difference when it comes to the transformer network model. There are also some residual connections around each module.

MULTI-HEADED ATTENTION MODULE

Following image shows whats inside is the multi-headed attention module. It has a self-attention mechanism. This module relates every word in the sequence to each other which is called “self- attention”, to make this happen we feed the 3 linear fully connected layers to create the “Query”, “Key” and “Value” vectors. A brief description of the Query Key Value meaning there is a nice example on stackexchange, also the MIT lecturers give the very same example in the course. For example, when you search for videos on Youtube, the search engine will map your query (text in the search bar) against a set of keys (video title, description, etc.) associated with candidate videos in their database, then present you the best matched videos (values).

From <https://stats.stackexchange.com/questions/421935/what-exactly-are-keys-queries-and-values-in-attention-mechanisms>

Multi headed attention module
Multi headed attention module in detail

But what a Linear layer actually do ? They take the very same positional embedding word vectors , do matrix multiplication with the Query , Key and Value vectors and generate output matrices. This output matrices are called “score matrices”.

Linear layer operation
Linear layer operation

To calculate the similarities between Query and Key vectors we do a dot product of the 2 vectors. And a scaling is applied to get the similarity. This gives the similarity measurement of the query and key vectors.

“Queries” and “Keys” are multiplied and the multiplication results with the “scores matrix”. Below there is an example for the scores matrix.

Score matrix before scaling
Scores Matrix

This matrix gives us how much a word is related to the other in the input sequence. the bigger the value, the bigger the attention. To prevent from getting too big results to calculate we do scale this score matrix by the square root of dimension of queries and the keys matrices. Then the scaled score matrix is put into the softmax function in order to get the “attention weights” matrix. Attention weights are then multiplied with value vectors which gives the output vectors .

Multi-headed attention is briefly computes the attention weights for the input and produces an output vector with encoded information on how each word in the input sequence ahould attend to other words.

After getting the attention weights from the multi headed attention module it is added to the original input matrix which is called a residual connection. After the addition operation the output goes through a normalization layer. after normalization the matrix goes through a feed forward network. This very same feed forward network is consists of a couple of linear layers and ReLU activation function between these layers. The output of this feed forward network is again added to the input of the feed forward network which was normalized and after normalization we get our multi headed attention output.

Normalization Layer and the feed forward network of the encoder

The encoder module could be multiplied and could be made to learn different attentions.

DECODER

Let us learn the basics of the decoder part of the model.

OUTPUT EMBEDDING & POSITIONAL ENCODING

The reply for the input question which is called output here is passed through an output embedding and positional encoding as it is done in the encoding layer. Then the positional information is passed through to the multi headed attention layer. If the network did not predict any words yet the input to this decoder block will be just <start> token. “I am fine” words will be predicted one word at a time.

Output embedding and positional encoding.

The decoder block has 2 different multi headed attention blocks very similar to the encoder block. Let we look in detail.

MULTI HEADED ATTENTION 1

The input sequence goes through the embedding and positional encoding layer , the output of this layer goes to the first multi-headed attention module. Likely to the multi headed attention layer in the encoder, this first module again computes score matrix from query and key matrices calculated from the “<start>” word. This score matrix will be an input for the second multi headed attention layer. These type of auto encoders are auto regressive, so they predict the output one word at a time. For example when we are computing scores for the word <start> we should not have access to the “I am fine” words. Similarly when computing attention scores on the word “am”, you should not have access to “fine” because “fine” is a future word that was generated after the word “am”. The word “am” should have only access to itself and the words before. This is also true for all other words , they can only attend to previous words. Below the pink cells must not be calculated in the attention score matrix.

Score matrix for the predicted words.

Hence, a masking mechanism is used to prevent decoder to looking to the future words. A mask is applied to the scores matrix hence the probabilities of the future words become 0. This mask is added before calculating the softmax and after scaling the score. the mask matrix is the same size with the score matrix and is filled with 0 values (for the older predictions) and negative infinity values for the future predictions. This matrix is added to the scores matrix and after this operation the future words scores are all negative infinity.

Masking the score matrix

then the masked scores goes through a softmax function and as a result all negative infinity values After taking the softmax of the “masked scores matrix” the negative infinity values become 0 as they have 0 probability. Here you can see the attention scores for the word “am” have values for itself and the prior words and 0 probabilities for the future words like “fine”. This method makes the model to put no attention for the future words.

Probability matrix of the masked scores matrix

This Mask optional layer is the difference of the 1st multi headed attention layer from the other multi-headed attention layers in the model. This layer has multiple heads and masks applied to input matrices before getting concetanated.

Multiple layers in the first multi headed attention module

MULTI HEADED ATTENTION 2

The output of the encoder module are the inputs to the second multi headed attention layer. Actually the queries and keys input of the 2nd multi headed attention module are the output of the encoder module. Getting the encoder outputs as inputs allows the model to which input to focus. after this layer the output goes through a feed forward layer.

The output of the feed forward network goes through a linear layer for classification. The linear layer does the classification by having all the words available as tokenized. In other words this layer has all possible words depicted as numbers. So the size is the same as the possible words maybe 10.000. The output of this linear classifier has a size of the word count as well. The output goes through a softmax layer , the output of the softmax layer has the very same size with the word count. The word with the biggest probability is the next prediction of the model. In our case after <start> it should predict the word “I”. The decoder adds this output to the decoder input list. Decoder continue to predict until the <end> token is predicted.

In the following image the token <end> has the highest probability and it became the output of the decoder. Since it loops until it gets the token <end> it continues to generate a response in our chatbot word by word or rather one word at a time.

linear classifier and the softmax layer

The transformer decoder could be multiplied, which will allow us to make the model learn to extract and focus of different combinations and attention from its attention heads which will help its predictions to be more accurate.

Thanks for reading :)

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Gupse Kobas

PhD candidate in computer science, ethnic circassian and a mother ...