How benchmark leakage and data contamination undermine LLMs evaluation

Image by author. (AI-assisted)

“Our new LLM beats GPT in every benchmark!”

It is becoming increasingly common to hear bold claims like this, as the hype around LLMs is huge. There are new models every week, and currently everyone is trying to compete with GPT-4, which is still the most powerful LLM.

Benchmarking is a critical part of evaluating progress in large language models.

Benchmarks like MMLU and HellaSwag are the standard for assessing language models on skills like reasoning and comprehension. The scores provide a snapshot of progress, with new state-of-the-art results heralded as breakthroughs. LLMs are usually evaluated in a zero-shot setting, without explicit training on the test set, to gauge their general abilities.

This article shows how easy it is to manipulate benchmark results and offers suggestions to maintain evaluation integrity.

The Trouble with Benchmarks

Often, benchmarks don’t reflect usefulness in real-life scenarios. Google’s newest model, Gemini Ultra, scores 90.04% on MMLU. While this is an impressive score, taking a closer look at the evaluation methodology, it is CoT@32 (chain of thought with 32 samples). It means we have to prompt 32 times to get 90% accuracy! Most of us are expecting an accurate answer in the first try, especially when interacting with a chatbot.

Google Gemini technical report. [1]

Unfortunately, this issue is just the tip of the iceberg of LLMs evaluation.

In machine learning, models are commonly evaluated by measuring their performance on a test set that was not used during training. Typically, this process allows for an unbiased estimate of how the model will generalize to new data.

Benchmark leakage and data contamination are two terms that both refer to a concerning issue: when the test data somehow leaks into the pretraining data of LLMs, leading to inflated performance. It makes comparisons between LLMs unfair and…

