Introduction to the Open LLM Falcon-40B: Performance, Training Data, and Architecture | by Benjamin Marie | Jun, 2023


Get started using Falcon-7B, Falcon-40B, and their instruct versions

The head of a falcon.
Photo by Brandon on Unsplash

The Falcon models have drawn a lot of attention since they have been released in May 2023.

They are causal large language models (LLM), or so-called “decoder-only” models, very much like GPT.

Definition: Causal Language Model

Causal language modeling involves predicting the token that follows a sequence of tokens. During training, the model’s attention is solely directed toward the left context. The right context is masked. These models are usually trained on billion words.

The Falcon models are completely free, even for commercial use (Apache 2.0 License), since May 31st. The Falcon models are developed and trained by the Technology Innovation Institute (TII) of Abu Dhabi.

According to the first results, Falcon-40B, the biggest of the Falcon models, outperforms all the other causal LLMs, including LLaMa-65B and MPT-7B.

In this blog post, I introduce in detail Falcon-40B, Falcon-7B, and their instruct versions. We will see how they perform compared to other models, how they were trained, and how to run Falcon7-B on your own GPU with QLoRa.

The instruct version of Falcon-40B is ranked first on the OpenLLM leaderboard. The standard version is ranked second.

The OpenLLM leaderboard evaluates the performance of LLMs on 4 tasks:

  • AI2 Reasoning Challenge (25-shot): Questions of grade-school science.
  • HellaSwag (10-shot): A commonsense inference benchmark.
  • MMLU (5-shot): 57 tasks in various domains such as maths, computer science, and law.
  • TruthfulQA (0-shot): A benchmark that evaluates how truthful is the model when answering questions.

Falcon-40B outperforms Meta AI’s LLaMa-65B on all these tasks.

The Falcon models were mainly trained on the Falcon RefinedWeb dataset. It was also created by TII and is distributed under an Apache 2.0 license.

RefinedWeb was extracted from CommonCrawl and has been thoroughly curated. TII claims it is multimodal-friendly since they preserved links and alt texts of images.

In the dataset card published in the Hugging Face Hub, TII wrote: “This public extract […]”. To me, it is thus unclear whether the Falcon models have been trained on this public version of the dataset, which is only an “extract”, or whether they have used a bigger internal version.

This extract requires 2.8 Tb of hard drive space to be unpacked.

Since it is available in the Hugging Face Hub, you only need to run the following lines to start using it:

from datasets import load_dataset
rw = load_dataset("tiiuae/falcon-refinedweb")

Note: You need the “datasets” library. If you don’t have it, you can install it with “pip install datasets”.

RefinedWeb was combined with curated corpora to train the Falcon models.

This dataset represents 75% of the pre-training data of the Falcon models. It covers only English. To add more languages, they have also prepared the “RefinedWeb-Europe” which covers several European languages: German, Spanish, French, Italian, Portuguese, Polish, Dutch, Romanian, Czech, and Swedish.

Finally, to cover more genres and domains, they added corpora of books, conversations (e.g., from Reddit), code, technical reports, and scientific papers (e.g., from arXiv). Note: They didn’t disclose the source for “code”. It is also unclear what are the licenses of the datasets they compiled.

In total, that’s 1,500 billion tokens used to pre-trained the Falcon models.

For pre-training, they used:

The Falcon-40B has the following architecture:

  • Layers: 60
  • Embedding dimensions: 8,192
  • Heads: 64
  • Vocabulary size: 65,024
  • Sequence length: 2,048

This is very similar to the architecture of LLaMa, except that the vocabulary is twice bigger.

In my opinion, the sequence length is quite short at a time when we see LLMs accepting sequences of more than 10,000 tokens, such as GPT-4 and Claude.

The Falcon-7B has a smaller architecture that enables its fine-tuning on consumer hardware. The only differences with the 40B version are that the number of layers and embedding dimensions are halved:

  • Layers: 60
  • Embedding dimensions: 4,544

Both versions were trained with bfloat16 precision and AdamW. They used AWS SageMaker with 384 A100 40GB GPUs in P4d instances but didn’t disclose yet how long the training lasted.

The instruct versions of Falcon-40B and 7B perform even better.

Falcon-40B-Instruct was trained on AWS SageMaker, utilizing P4d instances equipped with 64 A100 40GB GPUs. For Falcon-7B-Instruct, they only used 32 A100.

They were fine-tuned on 250 million tokens of a mixture of chat/instruct datasets sourced from Bai ze, GPT4all, GPTeacher, and 13 million tokens from the RefinedWeb corpus.

Bai ze is a dataset generated by ChatGPT. I would be cautious about using the instruct version of Falcon models in commercial applications. As per OpenAI’s terms of use:

“Restrictions. You may not […] (iii) use output from the Services to develop models that compete with OpenAI”

“Services” includes ChatGPT. And Falcon-40B is a model that can “compete” with OpenAI’s GPT models.

In a previous article, I introduced QLoRa to fine-tune LLMs on consumer hardware:

You can follow the same steps for Falcon-7B but it won’t work on the free instance of Google Colab. The model requires too much CPU RAM.

If you have 32 Gb of RAM in your computer, this should work. If you don’t have that much RAM, you will have to opt for cloud computing or Google Colab Pro, for instance.

Once you have an environment that can support Falcon-7B, there are still some minor modifications to perform to my QLoRa tutorial.

First, you must install “einops”:

pip install -q einops

Then, modify the loading of the model as follows:

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto", trust_remote_code=True)

In this line, “trust_remote_code=True” is necessary. This is the way Hugging Face gets your consent that some code is directly executed on your machine by the model. Here, Falcon runs a configuration script.

Other than that, everything else should work the same as in my tutorial.

If you don’t want to use QLoRa and have access to a GPU cluster, the standard way of loading and running Falcon-7B/Falcon-40B would be as described in the Hugging Face models’ cards:

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-40b"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
sequences = pipeline(
"Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.nDaniel: Hello, Girafatron!nGirafatron:",
max_length=200,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")

The Falcon models are pre-trained LLMs. You can use them for any natural language processing task if you have the data to fine-tune them. Note that, even without fine-tuning, the standard (non-instruct) versions already perform very well for many tasks as shown on the OpenLLM leaderboard for answering questions from various domains and for commonsense inference.

The “instruct” versions of the Falcon models are already fine-tuned. They behave like ChatGPT, i.e., a chatbot with general knowledge.

The Falcon models are also very interesting alternatives to the popular LLaMa model. Falcon-40B is:

  • Smaller: LLaMa is 65 billion parameters while Falcon-40B is only 40 billion parameters, so it requires less memory.
  • Better: On the OpenLLM leaderboard, Falcon-40B is ranked first.
  • Free: Falcon models are distributed under an Apache 2.0 license allowing commercial use while LLaMa can only be used for research purposes.

If you are interested in getting more information about these models, keep an eye on this blog post. TII will release a scientific paper/technical paper describing in more detail what they did. I’ll drop the link here once it is online.



Source link

Leave a Comment