In this article, we will see how Language Models (LM) can focus on better data and training strategies rather than just brute size to achieve LLM-like results (sometimes even better) and how people are already doing it successfully and democratically.
Large Language Models (LLMs) have evolved significantly. They bring remarkable features, from generating human-like text to understanding intricate contexts. While much of the initial excitement revolved around models with a massive number of parameters, recent developments suggest that size isn’t the only thing that matters. Lately, a new concept called Small Language Models (SLM) has risen with justice as a motivation to develop language models more intelligently.
As LLMs entered the stage, the narrative was straightforward — bigger is better. Models with more parameters are expected to understand the context better, make fewer mistakes, and provide better answers. But as the models grew, so did their hunger for computational resources. Training these behemoths became an expensive task, one that not everyone is willing (nor able) to pay for.
Recognizing the unsustainability and diminishing returns of just adding more parameters, researchers began to rethink strategies. Instead of merely throwing dollars into the cloud fire (adding another billion more parameters), some researchers shifted to utilizing better data and more efficient training strategies. The idea is elegant: a well-trained smaller model might outperform a poorly trained larger model. But can it?
Chinchilla and the Optimal Point for LLMs Training
The “Chinchilla paper” , a significant contribution to the field, offers intriguing insights into LLMs’ training. Experiments seem to indicate that there is an “optimal point” when training LLMs. Beyond this point, pouring more resources into training in the form of more parameters does not necessarily result in a proportional increase in performance. The paper emphasizes that it’s not only the size of a model that defines its performance. Instead, it’s about the quality of that data and how much data you use. The authors found that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of the model size, the number of training tokens should also be doubled.
They test this by training Chinchilla, a 70 billion parameters model trained on 1.4 trillion tokens. Despite being much smaller, Chinchilla outperforms Gopher on almost all evaluations, including language modeling, question answering, common sense tasks, etc.
Even with its reduced size, Chinchilla performs better than its SOTA counterparts on a variety of tasks:
Reading comprehension and automated reasoning are standard tasks a language model is typically tested on. It tests the model’s ability to understand the broader context of the text. In our case, it could be exemplified as predicting words that could only be expected if the model could understand the relation between this word and the context that came before it (sometimes far from this word’s position). It is usually evaluated using benchmarks and datasets such as RACE-h, RACE-m , and LAMBADA . Chinchilla outperforms much bigger models even on this type of hard-to-define and test tasks.
And Chinchilla is one of many LMs showing promising results despite not focusing on augmenting size.
LLaMA goes even further. The authors introduce smaller foundation language models ranging from 7B to 65B parameters. They are trained on over 1 trillion tokens using only publicly available data, making them compatible with open sourcing.
LLaMA-13B outperforms the much larger 175B parameter GPT-3 on most benchmarks while being over 10x smaller. The authors argue that given a target performance level, smaller models trained longer are preferable to larger models for a given compute budget due to better inference efficiency.
Some projects have even managed to run LLaMA (or rather a version of it) on budget Android smartphones, further proving that we are on the right path to democratizing access to performative LMs using low computing resources (LLaMA.c ).
LLaMA-65B (I know, not that small anymore, but still…) is competitive with the current state-of-the-art models like PaLM-540B, which use proprietary datasets. This clearly indicates how good data not only improves a model’s performance but can also make it democratic. A machine learning engineer would not need enormous budgets to get good model training on a good dataset.
Good data trumps the Goliath
Further reinforcing the thesis that LMs don’t need to be gigantic to perform well, TinyStories  presents a synthetic dataset of stories containing only words that small children (up to four years old) can understand. It can be used to train small language models (SLMs) with under 10 million parameters that can generate multi-paragraph stories with good grammar, reasoning, and coherence. This contrasts previous works where 125M+ parameter models — such as GPT-Neo (small) and GPT-2 (small) — struggled to produce a coherent text.
One of the exciting aspects of TinyStories is that the dataset itself was created by GPT-3.5 and GPT-4. The authors also introduce a new SLM evaluation paradigm using GPT-4 to “grade” generated stories on dimensions like grammar, plot, and creativity. This overcomes the limitations of standard benchmarks requiring constrained outputs.
The journey of LMs showcases a pivotal lesson in AI: Bigger is not always better. As the community continues to evolve and innovate, there’s a realization that efficiency, quality of data, and optimized training strategies hold the key to the future of machine learning.
- Chinchilla proves that there is an optimal point when training LMs regarding the number of tokens and the quality of training data used. It is as important as (or more) defining the number of parameters of the model;
- LLaMa shows Chinchilla-like results are achievable using only publicly available data, proving this strategy to be democratically available;
- Datasets like TinyStories can be used to train small language models (less than 100 million) that outperform billion-sized models on specific tasks.
 Hoffmann, Jordan, et al. “Training compute-optimal large language models.” arXiv preprint arXiv:2203.15556 (2022).
 D. Hendrycks, et al. “Measuring massive multitask language understanding.” arXiv preprint arXiv:2009.03300 (2020).
 J. Steinhardt. Updates and lessons from AI forecasting, 2021. URL https://bounded-regret.ghost.io/ai-forecasting/.
 Lai, Guokun, et al. “RACE: Large-scale ReAding Comprehension Dataset From Examinations.” In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794, Copenhagen, Denmark. Association for Computational Linguistics.
 Paperno et al., 2016 “The LAMBADA dataset: Word prediction requiring a broad discourse context.” arXiv:1606.06031 (2016).
 Touvron, Hugo et al. “LLaMA: Open and Efficient Foundation Language Models.” ArXiv abs/2302.13971 (2023)
 Eldan, Ronen and Yuan-Fang Li. “TinyStories: How Small Can Language Models Be and Still Speak Coherent English?” ArXiv abs/2305.07759 (2023)