Transformers…Attention is all you need!

12 min readFeb 4, 2021

TL;DR-Transformers are getting more and more important not just in NLP but now its going to extend its surface area into other areas of deep learning beyond just language. Google has rolled out BERT and transformer based models to google search, they have been using them to empower google search and they call it one of the biggest leaps forward in the the history of search.

In this notebook, we’ll focus on one paper that started it all, Attention is all you need!

Below is the architecture of the model as mentioned in the paper.

On a high-level, like any RNN based and CNN based sequence to sequence structure, Transformers is composed of an encoder and a decoder. The encoder converts the original input sequence into its latent representation in the form of hidden state vectors. The decoder tries to predict the output sequence using this latent representation.

The input (source) and output (target) sequence embeddings are added with positional encoding before being fed into the encoder and the decoder. The encoding component is a identical stack of encoders and the decoding component is a identical stack of decoders of the same number, the paper used 6 stacks but we will be using only 3 stacks of encoder and decoder layers.

Each encoder consists of two sub layers

a self-attention layer — a layer that helps the encoder look at other words in the input sentence as it encodes a specific word.
a feed-forward neural network , the exact same feed-forward network is independently applied to the word in each position through its own path in the encoder, hence it is called position-wise feed forward neural network.

The decoder also has self attention and feed forward layers, but between them is another attention layer that helps the decoder focus on relevant parts of the input sentence by assisting it to look at specific parts.

This is just a high-level interpretation of the model, we will deep dive to explain each component in detail.

Encoder

Let’s now have a closer look at the Encoder structure.

Once the data is prepared these inputs are passed on to the next layer, that is the embedding layer. The embedding layer too has an index for every word in the vocabulary and against each of those indices a vector is attached, initially these vectors are filled up with random numbers. Later on during training phase the model updates them with values that better help them with the assigned task.

The transformer paper, have gone with the embedding size of 512 and we will use the same here.

So what are word embeddings?

Well, these are just vector representation of a given word. Each dimension on word embeddings tries to capture about some concept or linguistic feature of that word, these could be things like whether the word is a verb, preposition, or an entity or something else.But in reality since the model decides these features itself during training, it could be difficult to find out exactly what information do each of these dimensions represent.

Graphically, the values of these dimensions represent the coordinates of the given word in some hyperspace. If two words share similar linguistic features and appear in similar contexts their embedding values are updated to become closer and closer during the training process.

For example consider these two words ‘King’ and ‘Queen’, initially their embeddings are randomly initialized but during the course of training they may become more and more similar since these two words often appear in similar context. This is when compared to the word ‘School’ that often appears in a whole different context. Hence the embedding layer selects the embedding corresponding to the input text and passes them further on. The embedding layer takes input indicies and converts them into word embeddings then these get passed further on to the next layer, the positional embeddings.

Positional embeddings, why do we need them?

If RNNs were to take up these embeddings it would do so sequentially, one embedding at a time, which is why they are so slow. There is a positive side to this however, since RNNs take the embeddings sequentially in the designated order they know which word came first which word came second and so on . Transformers on the other hand take up all embeddings at once. Now, even though this is a huge plus and makes transformers much faster the down side is that they loose the critical information related to word ordering. In simple words, they are not aware of which word came first in the sequence and which word came last, here’s why positional information matters.

Even though she did not win the award, she was satisfied.

Even though she did win the award, she was not satisfied.

Notice how the position of the single word not, not only changed the sentiment but also the meaning of the sentence.So what do we do to bring back the order information to the transformers without having to make them recurrent like RNNs. How about we introduce a new set of vectors containing the position information? Let us call them position embeddings.

We can start by simply adding the word embeddings to their corresponsing position embeddings and create a new order aware word embedings. But what values should our position embeddings contain, we start by literally filling in the word position numbers so the first postion embeddings has zeros, the next has all one’s and so on.

Before summing up with the position embedding, the token embeddings are multiplied by a scaling factor, which is square root of hidden dimension. This helps reduce variance in the embeddings. Dropout is then applied to the combined embeddings.

The combined embeddings are further passed on to the Encoding Layer along with the src_mask. The src_mask (source mask), is the same shape as the source sentence but has a value of 1 when the token in the source sentence is not a <pad> token and 0 when it is a <pad> token. This is used in the encoder layers to mask the multi-head attention mechanisms, which are used to calculate and apply attention over the source sentence, so the model does not pay attention to <pad> tokens as it doenot contain any useful information.

Encoder Layer

At a high level, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.

Here

we pass the source sentence and its mask into the multi-head attention layer
the output of this is passed to “Add and Norm” block (perform dropout on output of multi-head attention layer, apply a residual connection and pass it through a Layer Normalization layer).
then pass it through a position-wise feedforward layer
the output of this is again passed to another “Add and Norm” block (a set of dropout, residual connection and a Normalization layer)

“Add and Norm” within the transformer block plays a key role this is used to connect the inputs and outputs of other layers smoothly. we add a layer that contains a residual structure and a layer normalization after both the multi-head attention layer and the position-wise FFN network. Layer normalization can be thought of similar to batch normalization. One difference is that the mean and variances for the layer normalization are calculated along the last dimension (axis=-1) instead of the first batch dimension (axis=0).

Layer normalization prevents the range of values in the layers from changing too much, which allows faster training and better generalization ability.

Multi-Head Attention layer

Let us dive further and understand the components of Multi-Head Attention layer.

Attention mechanism helps the model to focus on important words in a given input sentence, transformers did not use simple attention, they used something called self attention. Consider this sentence, “He went to the bank to deposit some money, after which he went to a river bank for a walk.” Note how same word bank means two different things, here the first occurance of bank is referring to a finance institution while the second bank refers to the side of a river. So how can a model know which bank refers to what, we humans judge the meaning of the word by paying attention to the context in which it appears. For instance, deposit and money can indicate that first occurance of the word bank refers to financial institute, the word river indicates the second occurance means a river bank like wise the meaning of every word can be regarded as the sum of the words it pays the most attention to.

Now, the difference between simple and self attention is that, simple attention selectively focuses on words wrt to some external query, the more important the word is in determining the answer to that query the more focus it is given. Self attention on other hand also takes the relation ship among words within the same sentence into account and this is the layer where attention computation happen.

So let us dive further and understand the components of Multi-Head Attention layer.

The first component in this block are three linear layers, a linear layer is simply composed of a bunch of fully connected neurons without the activation function. They serve two main purposes:

mapping inputs onto the outputs
changing the (matrix/vector) dimension of the inputs themselves.

We take the embbedings of size 512, pass it to a linear layer and shrink the size to 256 . One of many reasons why we want to change or shrink the dimensions of an embedding vector is to save on the computation cost, the larger the vector the more operation it reqiures. EachMulti-Head Attention layer. node in the linear layer is connected to the input using its own set of weights, so what are these weights? well, these are just scalar numbers that the model updates during back propogation as it gets better and beter at the downstream task which in our case is machine traslation. It is also important to point out that these weights are fed to the model as a matrix. So we are done looking at the functionality of a single linear layer.

But the transformers have three seperate linear layer. Why is that? it turns out that each one of these layers has a special function we call them the Query, the Key and the Value linear layers.

This can be partially motivated by the way retrieval systems work. When we often type search requests on youtube, let us call the search request as a Query. Now let us assume that youtube search algorithm is quite simplistic, what it does is go through all video titles in its database, these titles can be termed as the Keys, now so as to find the best matches it will have to compute some sort of similarities between our Query and the corresponding Keys. Once the most similar key has been found it returns the video affiliated with that Key, we will call the contents of a video its Value. Notice how similarity can be thought of as a proxy to attention , this is because the model returns the best video only by paying attention to the most similar video title when compared to the search query.

Great, but how do we compute the similarity between a Query and Key? A great way to compute similaity between two vectors is with the cosine similarity, the cosine similarity can also be obtained by taking the dot product between the elements of the two vectors and then dividing by their magnitute for scaling purposes. Now, if you are to compute the similarity between matrix elements instead of vector, we will have to transpose the second element to avoid conflicts in dimensions during matrix multiplication.

How does this tie back to our attention layer and what exactly should we feed to Q,K and V linear layers?

To the query layer, we feed position aware embeddings, we then make two more copies of the embedding and feed the same to the key and the value layers. I know that makes no sense, because in the youtube examples didn’t the queries, keys and values mean different things and had very different contents. So why then here we are using the same content as input to Query, Key and the Value layers, well that is where the self attension part comes into play.

We take three embedding copies and then pass then through each of the linear layer, all that means is that we multiple the embeding layer with the weights of the linear layer. Note that each linear layer has its own set of weights. Since the matrix multiplicaton requires specific dimension we will have to transpose our embedding dimension accordingly. After multiplication each linear layer outputs a matrix and these are called the query, key and value matrices.

First we do a simple dot product of the Query and the transpose of key matrix, the output of the dot product can be called an attention filter.
Since this very important output lets understand its content. At the start of the training process the contents of the attention filter are more or less random numbers, but once the training process is done they take on more meaning full values. The scores inside this matrix are attenstion scores.
We then scale our attention score. The authors of the paper divided the score by the dimension of the key vector .
Finally we squash our attension score between the values of 0 and 1, using a softmax function and we get our final attention filter.
We also have a input that was passed to value linear layer to generate the value matrix

So we now have the original value matrix which pretty much represents the orginal embedding information because we did not alter them much except passing through a single linear layer. On the other hand we have attention filter computed using the dot product of q and k matrices.

When we multiply the attention filter with the value matrix we get a filter value matrix which assigns high focus to the features that are more important and this filtered value matrix is the final output of our multi head attention layer.

Transformers dont learn one attention filter they learn multiple each focusing on a different linguistic feature. Each attention head therefore outputs its own attention filter which inturn outputs its own filtered value matrix each zooming in on a different combination of linguistic features. In the paper, the authors used a total of 8 attention heads and we will be using the same.

What do we do next?

We simply go ahead and concatenate them together. Since we dont want this vector to grow longer and longer with each head used, we pass it through a linear layer to shrink its size back to sentence len X embeding size (512) this is the final output of final attention layer

Position-wise Feedforward Layer

Another key component in the Transformer block is called position-wise feed-forward network (FFN) this is relatively simple compared to the multi-head attention layer. It accepts a 3-dimensional input with shape [batch size, sequence length, hid dimension].

The position-wise FFN consists of two dense layers. Since the same two dense layers are used for each position item in the sequence, we referred to it as position-wise. It is equivalent to applying two 1×1 convolution layers.

The input is transformed from hid_dim to pf_dim, where pf_dim is usually a lot larger than hid_dim. The ReLU activation function and dropout are applied before it is transformed back into a hid_dim representation.

Decoder

Transformer Decoder is almost similar to the Encoder block.

Besides the two sub-layers (the multi-head attention layer and the positional encoding network), the decoder Transformer block contains a third sub-layer, which applies multi-head attention on the output of the encoder stack.

Similar to the Transformer encoder block, the Transformer decoder block employs ‘Add and norm’, i.e., the residual connections and the layer normalization to connect each of the sub-layers.

Conclusion

The following architecture using transformers provides a very good BLEU(Bilingual Evaluation Understudy Score) of 36.09% which is pretty good for our Multi30k Dataset which basically used for converting the English sentences to German Language.