Simplifying Transformers: State of the Art NLP Using Words You Understand — part 3— Attention | by Chen Margalit | Aug, 2023

Deep dive into the core technique of LLMs — attention

Tansformers have made a serious impact in the field of AI, perhaps in the entire world. This architecture is comprised of several components, but as the original paper is named “Attention is All You Need”, it’s somewhat evident that the attention mechanism holds particular significance. Part 3 of this series will primarily concentrate on attention and the functionalities around it that make sure the Transformer philharmonic plays well together.

Image from the original paper by Vaswani, A. et al.


In the context of Transformers, attention refers to a mechanism that enables the model to focus on relevant parts of the input during processing. Image a flashlight that shines on specific parts of a sentence, allowing the model to give more to less significance depending on context. I believe examples are more effective than definitions because they are some kind of brain teaser, providing the brain with the possibility to bridge the gaps and comprehend the concept on its own.

When presented with the sentence, “The man took the chair and disappeared,” you naturally assign varying degrees of importance (e.g. attention) to different parts of the sentence. Somewhat surprisingly, if we remove specific words, the meaning remains mostly intact: “man took chair disappeared.” Although this version is broken English, compared to the original sentence you can still understand the essence of the message. Interestingly, three words (“The,” “the,” and “and”) account for 43% of the words in the sentence but do not contribute significantly to the overall meaning. This observation was probably clear to every Berliner who came across my amazing German while living there (one can either learn German or be happy, it’s a decision you have to make) but it’s much less apparent to ML models.

In the past, previous architectures like RNNs, (Recurrent Neural Networks) faced a significant challenge: they struggled to “remember” words that appeared far back in the input sequence, typically beyond 20 words. As you already know, these models essentially rely on mathematical operations to process data. Unfortunately, the mathematical operations used in earlier architectures were not efficient enough to carry word representations adequately into the distant future of the sequence.

This limitation in long-term dependency hindered the ability of RNNs to maintain contextual information over extended periods, impacting tasks such as language translation or sentiment analysis where understanding the entire input sequence is crucial. However, Transformers, with their attention mechanism and self-attention mechanisms, address this issue more effectively. They can efficiently capture dependencies across long distances in the input, enabling the model to retain context and associations even for words that appear much earlier in the sequence. As a result, Transformers have become a groundbreaking solution to overcome the limitations of previous architectures and have significantly improved the performance of various natural language processing tasks.

To create exceptional products like the advanced chatbots we encounter today, it is essential to equip the model with the ability to distinguish between high and low-value words and also retain contextual information over long distances in the input. The mechanism introduced in the Transformers architecture to address these challenges is known as attention.

*Humans have been developing techniques to discriminate between humans for a very long time, but as inspiring as they are, we won’t be using those here.

Dot Product

How can a model even theoretically discern the importance of different words? When analyzing a sentence, we aim to identify the words that have stronger relationships with one another. As words are represented as vectors (of numbers), we need a measurement for the similarity between numbers. The mathematical term for measuring vector similarity is “Dot Product.” It involves multiplying elements of two vectors and producing a scalar value (e.g., 2, 16, -4.43), which serves as a representation of their similarity. Machine Learning is grounded in various mathematical operations, and among them, the Dot Product holds particular importance. Hence, I’ll take the time to elaborate on this concept.

Image we have real representations (embeddings) for 5 words: “florida”, “california”, “texas”, “politics” and “truth”. As embeddings are just numbers, we can potentially plot them on a graph. However, due to their high dimensionality (the number of numbers used to represent the word), which can easily range from 100 to 1000, we can’t really plot them as they are. We can’t plot a 100-dimensional vector on a 2D computer/phone screen. Moreover, The human brain finds it difficult to understand something above 3 dimensions. What does a 4-dimensional vector look like? I don’t know.

To overcome this issue, we utilize Principal Component Analysis (PCA), a technique that reduces the number of dimensions. By applying PCA, we can project the embeddings onto a 2-dimensional space (x,y coordinates). This reduction in dimensions helps us visualize the data on a graph. Although we lose some information due to the reduction, hopefully, these reduced vectors will still preserve enough similarities to the original embeddings, enabling us to gain insights and comprehend relationships among the words.

These numbers are based on the GloVe Embeddings.

florida = [-2.40062016,  0.00478901]
california = [-2.54245794, -0.37579669]
texas = [-2.24764634, -0.12963368]
politics = [3.02004564, 2.88826688]
truth = [4.17067881, -2.38762552]

You can perhaps notice there is some pattern in the numbers, but we’ll plot the numbers to make life easier.

5 2D vectors

In this visualization, we see five 2D vectors (x,y coordinates), representing 5 different words. As you can see, the plot suggests some words are much more related to others.

The mathematical counterpart of visualizing vectors can be expressed through a straightforward equation. If you aren’t particularly fond of mathematics and recall the authors’ description of the Transformers architecture as a “simple network architecture,” you probably think this is what happens to ML people, they get weird. It’s probably true, but not in this case, this is simple. I’ll explain:

Dot Product Formula

The symbol ||a|| denotes the magnitude of vector “a,” which represents the distance between the origin (point 0,0) and the tip of the vector. The calculation for the magnitude is as follows:

Vector magnitude formula

The outcome of this calculation is a number, such as 4, or 12.4.
Theta (θ), refers to the angle between the vectors (look at the visualization). The cosine of theta, denoted as cos(θ), is simply the result of applying the cosine function to that angle.

Using the GloVe algorithm, researchers from Stanford University have generated embeddings for actual words, as we discussed earlier. Although they have their specific technique for creating these embeddings, the underlying concept remains the same as we talked about in the previous part of the series. As an example, I took 4 words, reduced their dimensionality to 2, and then plotted their vectors as straightforward x and y coordinates.

To make this process function correctly, downloading the GloVe embeddings is a necessary prerequisite.

*Part of the code, especially the first box is inspired by some code I’ve seen, but I can’t seem to find the source.

import pandas as pd

path_to_glove_embds = 'glove.6B.100d.txt'

glove = pd.read_csv(path_to_glove_embds, sep=" ", header=None, index_col=0)
glove_embedding = {key: val.values for key, val in glove.T.items()}

words = ['florida', 'california', 'texas', 'politics', 'truth']
word_embeddings = [glove_embedding[word] for word in words]

print(word_embeddings[0]).shape # 100 numbers to represent each word.

pca = PCA(n_components=2) # reduce dimensionality from 100 to 2.
word_embeddings_pca = pca.fit_transform(word_embeddings)
for i in range(5):

[-2.40062016 0.00478901] # florida
[-2.54245794 -0.37579669] # california
[-2.24764634 -0.12963368] # texas
[3.02004564 2.88826688] # politics
[ 4.17067881 -2.38762552] # truth

We now possess a genuine representation of all 5 words. Our next step is to conduct the dot product calculations.

Vector magnitude:

import numpy as np

florida_vector = [-2.40062016, 0.00478901]
florida_vector_magnitude = np.linalg.norm(florida_vector)

2.4006249368060817 # The magnitude of the vector "florida" is 2.4.

Dot Product between two similar vectors.

import numpy as np

florida_vector = [-2.40062016, 0.00478901]
texas_vector = [-2.24764634 -0.12963368]

print(, texas_vector))


Dot Product between two dissimilar vectors.

import numpy as np

florida_vector = [-2.40062016, 0.00478901]
truth_vector = [4.17067881, -2.38762552]

print(, truth_vector))


As evident from the dot product calculation, it appears to capture and reflect an understanding of similarities between different concepts.

Scaled Dot-Product attention

Now that we have a grasp of Dot Product, we can delve back into attention. Particularly, the self-attention mechanism. Using self-attention provides the model with the ability to determine the importance of each word, regardless of its “physical” proximity to the word. This enables the model to make informed decisions based on the contextual relevance of each word, leading to better understanding.

To achieve this ambitious goal, we create 3 matrics composed out of learnable (!) parameters, known as Query, Key and Value (Q, K, V). The query matrix can be envisioned as a query matrix containing the words the user inquires or asks for (e.g. when you ask chatGPT if: “god is available today at 5 p.m.?” that is the query). The Key matrix encompasses all other words in the sequence. By computing the dot product between these matrices, we get the degree of relatedness between each word and the word we are currently examining (e.g., translating, or producing the answer to the query).

The Value Matrix provides the “clean” representation for every word in the sequence. Why do I refer to it as clean where the other two matrices are formed in a similar manner? because the value matrix remains in its original form, we don’t use it after multiplication by another matrix or normalize it by some value. This distinction sets the Value matrix apart, ensuring that it preserves the original embeddings, free from additional computations or transformations.

All 3 matrices are constructed with a size of word_embedding (512). However, they are divided into “heads”. In the paper the authors used 8 heads, resulting in each matrix having a size of sequence_length by 64. You might wonder why the same operation is performed 8 times with 1/8 of the data and not once with all the data. The rationale behind this approach is that by conducting the same operation 8 times with 8 different sets of weights (which are as mentioned, learnable), we can exploit the inherent diversity in the data. Each head can focus on a specific aspect within the input and in aggregate, this can lead to better performance.

*In most implementations we don’t really divide the main matrix to 8. The division is achieved through indexing, allowing parallel processing for each part. However, these are just implementation details. Theoretically, we could’ve done pretty much the same using 8 matrices.

The Q and K are multiplied (dot product) and then normalized by the square root of the number of dimensions. We pass the result through a Softmax function and the result is then multiplied by the matrix V.
The reason for normalizing the results is that Q and K are matrices that are generated somewhat randomly. Their dimensions might be completely unrelated (independent) and multiplications between independent matrices might result in very big numbers which can harm the learning as I’ll explain later in this part.
We then use a non-linear transformation named Softmax, to make all numbers range between 0 and 1, and sum to 1. The result is similar to a probability distribution (as there are numbers from 0 to 1 that add up to 1). These numbers exemplify the relevance of every word to every other word in the sequence.
Finally, we multiply the result by matrix V, and lo and behold, we’ve got the self-attention score.

*The encoder is actually built out of N (in the paper, N=6) identical layers, each such layer gets its input from the previous layer and does the same. The final layer passes the data both to the Decoder (which we will talk about in a later part of this series) and to the upper layers of the Encoder.

Here is a visualization of self-attention. It’s like groups of friends in a classroom. Some people are more connected to some people. Some people aren’t very well connected to anyone.

Image from the original paper by Vaswani, A. et al.

Thq Q, K and V matrices are derived through a linear transformation of the embedding matrix. Linear transformations are important in machine learning, and if you have an interest in becoming an ML practitioner, I recommend exploring them further. I won’t delve deep, but I will say that linear transformation is a mathematical operation that moves a vector (or a matrix) from one space to another space. It sounds more complex than it is. Imagine an arrow pointing in one direction, and then moving to point 30 degrees to the right. This illustrates a linear transformation. There are a few conditions for such an operation to be considered linear but it’s not really important for now. The key takeaway is that it retains many of the original vector properties.

The entire calculation of the self-attention layers is performed by applying the following formula:

Scaled Dot-Product Attention —Image from the original paper by Vaswani, A. et al.

The calculation process unfolds as follows:
1. We multiply Q by K transposed (flipped).
2. We divide the result by the square root of the dimensionality of matrix K.
3. We now have the “attention matrix scores” that describe how similar every word is to every other word. We pass every row to a Softmax (a non-linear) transformation. Softmax does three interesting relevant things:
a. It scales all the numbers so they are between 0 and 1.
b. It makes all the numbers sum to 1.
c. It accentuates the gaps, making the slightly more important, much more important. As a result, we can now easily distinguish the varying degrees to which the model perceives the connection between words x1 and x2, x3, x4, and so on.
4. We multiply the score by the V matrix. This is the final result of the self-attention operation.


In the previous chapter in this series, I’ve explained that we employ dummy tokens to treat special occurrences in the sentence such as the first word in the sentence, the last word, etc. One of these tokens, denoted as <PADDING>, indicates that there is no actual data, and yet we need to maintain consistent matrix sizes throughout the entire process. To ensure the model comprehends these are dummy tokens and should therefore not be considered during the self-attention calculation, we represent these tokens as minus infinity (e.g. a very large negative number, e.g. -153513871339). The masking values are added to the result of the multiplication between Q by K. Softmax then turns these numbers into 0. This allows us to effectively ignore the dummy tokens during the attention mechanism while preserving the integrity of the calculations.


Following the self-attention layer, a dropout operation is applied. Dropout is a regularization technique widely used in Machine Learning. The purpose of regularization is to impose constraints on the model during training, making it more difficult for the model to rely heavily on specific input details. As a result, the model learns more robustly and improves its ability to generalize. The actual implementation involves choosing some of the activations (the numbers coming out of different layers) randomly, and zeroing them out. In every pass of the same layer, different activations will be zeroed out preventing the model from finding solutions that are specific to the data it gets. In essence, dropout helps in enhancing the model’s ability to handle diverse inputs and making it more difficult for the model to be tailored to specific patterns in the data.

Skip connection

Another important operation done in the Transformer architecture is called Skip Connection.

Image from the original paper by Vaswani, A. et al.

Skip Connection is a way to pass input without subjecting it to any transformation. To illustrate, imagine that I report to my manager who reports it to his manager. Even with very pure intentions of making the report more useful, the input now goes through some modifications when processed by another human (or ML layer). In this analogy, the Skip-Connection would be me, reporting straight to my manager’s manager. Consequently, the upper manager receives input both through my manager (processed data) and straight from me (unprocessed). The senior manager can then hopefully make a better decision. The rationale behind employing skip connections is to address potential issues such as vanishing gradients which I will explain in the following section.

Add & Norm Layer

The “Add & Norm” layer performs addition and normalization. I’ll start with addition as it’s simpler. Basically, we add the output from the self-attention layer to the original input (received from the skip connection). This addition is done element-wise (every number to its same positioned number). The result is then normalized.

The reason we normalize, again, is that each layer performs numerous calculations. Multiplying numbers many times can lead to unintended scenarios. For instance, if I take a fraction, like 0.3, and I multiply it with another fraction, like 0.9, I get 0.27 which is smaller than where it started. if I do this many times, I might end up with something very close to 0. This could lead to a problem in deep learning called vanishing gradients.
I won’t go too deep right now so this article doesn’t take ages to read, but the idea is that if numbers become very close to 0, the model won’t be able to learn. The basis of modern ML is calculating gradients and adjusting the weights using those gradients (and a few other ingredients). If those gradients are close to 0, it will be very difficult for the model to learn effectively.

On the contrary, the opposite phenomenon, called exploding gradients, can occur when numbers that are not fractions get multiplied by non-fractions, causing the values to become excessively large. As a result, the model faces difficulties in learning due to the enormous changes in weights and activations, which can lead to instability and divergence during the training process.

ML models are somewhat like a small child, they need protection. One of the ways to protect these models from numbers getting too big or too small is normalization.

The layer normalization operation looks frightening (as always) but it’s actually relatively simple.

Image by Pytorch, taken from here

In the layer normalization operation, we follow these simple steps for each input:

  1. Subtract its mean from the input.
  2. Divide by the square root of the variance and add an epsilon (some tiny number), used to avoid division by zero.
  3. Multiply the resulting score by a learnable parameter called gamma (γ).
  4. Add another learnable parameter called beta (β).

These steps ensure the mean will be close to 0 and the standard deviation close to 1. The normalization process enhances the training’s stability, speed, and overall performance.


# x being the input.

(x - mean(x)) / sqrt(variance(x) + epsilon) * gamma + beta


At this point, we have a solid understanding of the main inner workings of the Encoder. Additionally, we have explored Skip Connections, a purely technical (and important) technique in ML that improves the model’s ability to learn.

Although this section is a bit complicated, you have already acquired a substantial understanding of the Transformers architecture as a whole. As we progress further in the series, this understanding will serve you in understanding the remaining parts.
Remember, this is the State of the Art in a complicated field. This isn’t easy stuff. Even if you still don’t understand everything 100%, well done for making this great progress!

The next part will be about a foundational (and simpler) concept in Machine Learning, the Feed Forward Neural Network.

Image from the original paper by Vaswani, A. et al.

Source link

Leave a Comment