Learning Transformers Code First Part 2 — GPT Up Close and Personal | by Lily Hughes-Robinson | Jul, 2023


With that said, let’s dive in. To comprehend GPT models in detail, we must start with the transformer. The transformer employs a self-attention mechanism known as scaled dot-product attention. The following explanation is derived from this insightful article on scaled dot-product attention, which I recommend for a more in-depth understanding. Essentially, for every element of an input sequence (the i-th element), we want to multiply the input sequence by a weighted average of all the elements in the sequence with the i-th element. These weights are calculated via taking the dot-product of the vector at the i-th element with the entire input vector and then applying a softmax to it so the weights are values between 0 and 1. In the original “Attention is All You Need” paper, these inputs are named query (the entire sequence), key (the vector at the i-th element) and the value (also the whole sequence). The weights passed to the attention mechanism are initialized to random values and learned as more passes occur within a neural network.

nanoGPT implements scaled dot-product attention and extends it to multi-head attention, meaning multiple attention operations occurring at once. It also implements it as a torch.nn.Module, which allows it to be composed with other network layers

import torch
import torch.nn as nn
from torch.nn import functional as F

class CausalSelfAttention(nn.Module):

def __init__(self, config):
super().__init__()
assert config.n_embd % config.n_head == 0
# key, query, value projections for all heads, but in a batch
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
# output projection
self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
# regularization
self.attn_dropout = nn.Dropout(config.dropout)
self.resid_dropout = nn.Dropout(config.dropout)
self.n_head = config.n_head
self.n_embd = config.n_embd
self.dropout = config.dropout
# flash attention make GPU go brrrrr but support is only in PyTorch >= 2.0
self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention')
if not self.flash:
print("WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0")
# causal mask to ensure that attention is only applied to the left in the input sequence
self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
.view(1, 1, config.block_size, config.block_size))

def forward(self, x):
B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)

# calculate query, key, values for all heads in batch and move head forward to be the batch dim
q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

# causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
if self.flash:
# efficient attention using Flash Attention CUDA kernels
y = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=self.dropout if self.training else 0, is_causal=True)
else:
# manual implementation of attention
att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
att = F.softmax(att, dim=-1)
att = self.attn_dropout(att)
y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side

# output projection
y = self.resid_dropout(self.c_proj(y))
return y

Let’s dissect this code further, starting with the constructor. First, we verify that the number of attention heads (n_heads) divides the dimensionality of the embedding (n_embed) evenly. This is crucial because when the embedding is divided into sections for each head, we want to cover all of the embedding space without any gaps. Next, we initialize two Linear layers, c_att and c_proj: c_att is the layer that holds all our working space for the matrices that compose of a scaled dot-product attention calculation while c_proj stores the finally result of the calculations. The embedding dimension is tripled in c_att because we need to include space for the three major components of attention: query, key, and value.

We also have two dropout layers, attn_dropoutand resid_dropout. The dropout layers randomly nullify elements of the input matrix based on a given probability. According to the PyTorch docs, this serves the purpose of reducing overfitting for the model. The value in config.dropout is the probability that a given sample will be dropped during a dropout layer.

We finalize the constructor by verifying if the user has access to PyTorch 2.0, which boasts an optimized version of the scaled dot-product attention. If available, the class utilizes it; otherwise we set up a bias mask. This mask is a component of the optional masking feature of the attention mechanism. The torch.tril method yields a matrix with its upper triangular section converted to zeros. When combined with the torch.ones method, it effectively generates a mask of 1s and 0s that the attention mechanism uses to produce anticipated outputs for a given sampled input.

Next, we delve into the forward method of the class, where the attention algorithm is applied. Initially, we determine the sizes of our input matrix and divide it into three dimensions: Batch size, Time (or number of samples), Corpus (or embedding size). nanoGPT employs a batched learning process, which we will explore in greater detail when examining the transformer model that utilizes this attention layer. For now, it’s sufficient to understand that we are dealing with the data in batches. We then feed the input x into the linear transformation layer c_attn which expands the dimensionality from n_embed to three times n_embed. The output of that transformation is split it into our q, k, v variables which are our inputs to the attention algorithm. Subsequently, the view method is utilized to reorganize the data in each of these variables into the format expected by the PyTorch scaled_dot_product_attention function.

When the optimized function isn’t available, the code defaults to a manual implementation of scaled dot-product attention. It begins by taking the dot product of the q and k matrices, with k transposed to fit the dot product function, and the result is scaled by the square root of the size of k. We then mask the scaled output using the previously created bias buffer, replacing the 0s with negative infinity. Next, a softmax function is applied to the att matrix, converting the negative infinities back to 0s and ensuring all other values are scaled between 0 and 1. We then apply a dropout layer to avoid overfitting before getting the dot-product of the att matrix and v.

Regardless of the scaled dot-product implementation used, the multi-head output is reorganized side by side before passing it through a final dropout layer and then returning the result. This is the complete implementation of the attention layer in less than 50 lines of Python/PyTorch. If you don’t fully comprehend the above code, I recommend spending some time reviewing it before proceeding with the rest of the article.



Source link

Leave a Comment