Everything You Should Know About Evaluating Large Language Models | by Donato Riccio | Aug, 2023

Open Language Models

From perplexity to measuring general intelligence

Image generated by the author using Stable Diffusion.

As open source language models become more readily available, getting lost in all the options is easy.

How do we determine their performance and compare them? And how can we confidently say that one model is better than another?

This article provides some answers by presenting training and evaluation metrics, and general and specific benchmarks to have a clear picture of your model’s performance.

If you missed it, take a look at the first article in the Open Language Models series:

Language models define a probability distribution over a vocabulary of words to select the most likely next word in a sequence. Given a text, a language model assigns a probability to each word in the language, and the most likely is selected.

Perplexity measures how well a language model can predict the next word in a given sequence. As a training metric, it shows how well the models learned its training set.

We won’t go into the mathematical details but intuitively, minimizing perplexity means maximizing the predicted probability.

In other words, the best model is the one that is not surprised when it sees the new text because it’s expecting it — meaning it already predicted well what words are coming next in the sequence.

While perplexity is helpful, it doesn’t consider the meaning behind the words or the context in which they are used, and it’s influenced by how we tokenize our data — different language models with varying vocabularies and tokenization techniques can produce varying perplexity scores, making direct comparisons less meaningful.

Perplexity is a useful but limited metric. We use it primarily to track progress during a model’s training or to compare…

Source link

Leave a Comment