With that said, let’s dive in. To comprehend GPT models in detail, we must start with the transformer. The transformer employs a self-attention mechanism known as scaled dot-product attention. The following explanation is derived from this insightful article on scaled dot-product attention, which I recommend for a more in-depth understanding. Essentially, for every element of an input sequence (the i-th element), we want to multiply the input sequence by a weighted average of all the elements in the sequence with the i-th element. These weights are calculated via taking the dot-product of the vector at the i-th element with the entire input vector and then applying a softmax to it so the weights are values between 0 and 1. In the original “Attention is All You Need” paper, these inputs are named query ( the entire sequence), key (the vector at the i-th element) and the value (also the whole sequence). The weights passed to the attention mechanism are initialized to random values and learned as more passes occur within a neural network.

nanoGPT implements scaled dot-product attention and extends it to multi-head attention, meaning multiple attention operations occurring at once. It also implements it as a `torch.nn.Module`

, which allows it to be composed with other network layers

import torch import torch.nn as nn from torch.nn import functional as Fclass CausalSelfAttention(nn.Module):

def __init__(self, config): super().__init__() assert config.n_embd % config.n_head == 0 # key, query, value projections for all heads, but in a batch self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias) # output projection self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias) # regularization self.attn_dropout = nn.Dropout(config.dropout) self.resid_dropout = nn.Dropout(config.dropout) self.n_head = config.n_head self.n_embd = config.n_embd self.dropout = config.dropout # flash attention make GPU go brrrrr but support is only in PyTorch >= 2.0 self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention') if not self.flash: print("WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0") # causal mask to ensure that attention is only applied to the left in the input sequence self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size)) .view(1, 1, config.block_size, config.block_size))

def forward(self, x): B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)

# calculate query, key, values for all heads in batch and move head forward to be the batch dim q, k, v = self.c_attn(x).split(self.n_embd, dim=2) k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs) q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs) v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

# causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T) if self.flash: # efficient attention using Flash Attention CUDA kernels y = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=self.dropout if self.training else 0, is_causal=True) else: # manual implementation of attention att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1))) att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf')) att = F.softmax(att, dim=-1) att = self.attn_dropout(att) y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs) y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side

# output projection y = self.resid_dropout(self.c_proj(y)) return y

Let’s dissect this code further, starting with the constructor. First, we verify that the number of attention heads (`n_heads`

) divides the dimensionality of the embedding (`n_embed`

) evenly. This is crucial because when the embedding is divided into sections for each head, we want to cover all of the embedding space without any gaps. Next, we initialize two Linear layers, `c_att`

and `c_proj`

: `c_att`

is the layer that holds all our working space for the matrices that compose of a scaled dot-product attention calculation while `c_proj`

stores the finally result of the calculations. The embedding dimension is tripled in `c_att`

because we need to include space for the three major components of attention: query , key , and value .

We also have two dropout layers, `attn_dropout`

and `resid_dropout`

. The dropout layers randomly nullify elements of the input matrix based on a given probability. According to the PyTorch docs , this serves the purpose of reducing overfitting for the model. The value in `config.dropout`

is the probability that a given sample will be dropped during a dropout layer.

We finalize the constructor by verifying if the user has access to PyTorch 2.0, which boasts an optimized version of the scaled dot-product attention. If available, the class utilizes it; otherwise we set up a bias mask. This mask is a component of the optional masking feature of the attention mechanism. The torch.tril method yields a matrix with its upper triangular section converted to zeros. When combined with the torch.ones method, it effectively generates a mask of 1s and 0s that the attention mechanism uses to produce anticipated outputs for a given sampled input.

Next, we delve into the `forward`

method of the class, where the attention algorithm is applied. Initially, we determine the sizes of our input matrix and divide it into three dimensions: B atch size, T ime (or number of samples), C orpus (or embedding size). nanoGPT employs a batched learning process, which we will explore in greater detail when examining the transformer model that utilizes this attention layer. For now, it’s sufficient to understand that we are dealing with the data in batches. We then feed the input `x`

into the linear transformation layer `c_attn`

which expands the dimensionality from `n_embed`

to three times `n_embed`

. The output of that transformation is split it into our `q`

, `k`

, `v`

variables which are our inputs to the attention algorithm. Subsequently, the `view`

method is utilized to reorganize the data in each of these variables into the format expected by the PyTorch `scaled_dot_product_attention`

function.

When the optimized function isn’t available, the code defaults to a manual implementation of scaled dot-product attention. It begins by taking the dot product of the `q`

and `k`

matrices, with `k`

transposed to fit the dot product function, and the result is scaled by the square root of the size of `k`

. We then mask the scaled output using the previously created bias buffer, replacing the 0s with negative infinity. Next, a softmax function is applied to the `att`

matrix, converting the negative infinities back to 0s and ensuring all other values are scaled between 0 and 1. We then apply a dropout layer to avoid overfitting before getting the dot-product of the `att`

matrix and `v`

.

Regardless of the scaled dot-product implementation used, the multi-head output is reorganized side by side before passing it through a final dropout layer and then returning the result. This is the complete implementation of the attention layer in less than 50 lines of Python/PyTorch. If you don’t fully comprehend the above code, I recommend spending some time reviewing it before proceeding with the rest of the article.