Padding Large Language Models — Examples with Llama 2 | by Benjamin Marie | Aug, 2023


Best practices to pad training examples for causal LLMs

Image by the author — Based on an image from Pixabay

Padding is one of the most under-documented aspects of large language models (LLMs). Why? Simply because LLMs are usually pre-trained without padding.

Nonetheless, for fine-tuning LLMs on custom datasets, padding is necessary. Failing to correctly pad training examples may result in different kinds of unexpected behaviors: Null loss or infinity loss during training, over-generation, or empty output during inference, are all symptoms of incorrect padding.

In this article, I first explain what is padding and why it is necessary. Then, I show how you can find the correct padding strategy for an LLM pre-trained without padding. I propose two different solutions to add padding support to LLMs using Hugging Face’s Transformers.

Toward the end of the article, I also provide examples showing how to pad your training examples for Llama 2.

After reading this article, you should be able to figure out by yourself how to pad training examples for LLMs without reading their documentation or tutorials.

What is padding and why do we pad?

Let’s take one example that we wish to use for fine-tuning an LLM.

example = "You are not a chatbot."

We have to turn this example into a sequence of tokens. Libraries, such as Transformers, usually tokenize following these steps:

  • Segment the example into subwords according to a given vocabulary:
example = ["▁You", "▁are", "▁not", "▁a". "▁chat", "bot", "."]
  • Replace words by their index from the vocabulary to obtain a sequence of integers:
example = [887, 526, 451, 263, 13563, 7451, 29889]
  • Add special tokens to the sequence: BOS token, EOS token, UNK token, PAD token, etc.
example = [1, 887, 526, 451, 263, 13563, 7451, 29889]

Note: For this example, I use Llama 2’s tokenizer. We will see below in detail how to do it.



Source link

Leave a Comment