Training Language Models with Textbook-Quality Synthetic Data | by Vincent Vatter | Jun, 2023


The utility of synthetic data — data generated by models themselves — has been a topic of much debate. Attempts to train smaller models on the output of larger models, such as in the creation of Alpaca and Vicuna, have met with skepticism. Critics often point to arguments such as those in the Berkeley paper The False Promise of Imitating Proprietary LLMs, which states that “model imitation is a false promise: there exists a substantial capabilities gap between open and closed LMs that, with current methods, can only be bridged using an unwieldy amount of imitation data or by using more capable base LMs”.

However, Textbooks Are All You Need challenges this perspective, demonstrating that the output of larger models can be utilized for purposes beyond mere imitation. Remarkably, the paper’s small model even manages to outperform the large model that generated the synthetic data it was trained on. This observation prompts a tantalizing question: Could the performance of large models be enhanced by training them on their own output?

Before delving into the training data used to train the models, let’s glance at the results they achieve. The three models in the paper are phi-1-base, phi-1, and phi-1-small. Notably, these models aren’t just compact in terms of parameters, they’re also trained on limited data. Given this, their performance is nothing short of astonishing.

+ — — — — — — — — — — — + — — — — — — — + — — — — — — — + — — — — — -+ | Model | Model size | Dataset size | HumanEval | | | (parameters) | (tokens) | (pass@1) | + — — — — — — — — — — — + — — — — — — — + — — — — — — — + — — — — — -+ | GPT-4 | ? | ? | 67.0% | | WizardCoder | 16.0B | 1000B | 57.3% | | phi-1 (finetuned) | 1.3B | 7B | 50.6% | | GPT-3.5 |
Evaluation of selected models on the HumanEval benchmark. Source: adapted from Textbooks Are All You Need.

The scores here are on OpenAI’s HumanEval benchmark, introduced in their paper Evaluating Large Language Models Trained on Code. In the problems in this benchmark, the model is told a function signature and docstring, and asked to write the body of the function. To illustrate, consider the following example drawn from the HumanEval paper, where the model is given the following signature and docstring.

def incr_list(l: list): “””Return list with elements incremented by 1. >>> incr_list([1, 2, 3]) [2, 3, 4] >>> incr_list([5, 3, 5, 2, 3, 3, 9, 0, 123]) [6, 4, 6, 3, 4, 4, 10, 1, 124] “””
Source: Evaluating Large Language Models Trained on Code.

For this problem, we hope the model would generate something like this:

return [i + 1 for i in l]
Source: Evaluating Large Language Models Trained on Code.

However, the model is not evaluated based on producing this exact string (that would require the model to solve the problem in the same way and with the same variable names as the solution), but rather whatever body the model produces is evaluated on several unit tests (on average, 7.7 unit tests per problem, each test consisting of a choice of parameters for the function and the expected output that the generated code needs to match). The code is then deemed to be correct if it passes all of the unit tests. The pass@1 metric in the table above is merely the percentage of generated function bodies that pass all of the unit tests. The more general pass@k metrics allow models to general k samples, and consider it a success if any one of those samples passes all of the unit tests.

The models in the paper were trained on data from three different sources. The first, The Stack+, is a 35B-token, deduplicated version of The Stack, together with code from StackOverflow, and restricted to Python. However, it’s important to note that phi-1 and its variants are not trained on this source. Instead, these models are trained on CodeTextbook, a textbook-quality 6B-token filtered selection from The Stack+ together with a 1B-token synthetic component, and CodeExercises, a 180M-token synthetic set of exercises and solutions mirroring the problem style found in the HumanEval dataset. The effects are shown in the figure below.

HumanEval results after training on various sources. Image from Textbooks Are All You Need.

Here we see 9 models with varying parameters trained on varying subsets of this data. The models in light green in this chart are trained only on CodeTextbook, and not on The Stack+, so it is evident that CodeTextbook is a better source. The fine-tuning on CodeExercises that the models in dark green received makes an even bigger difference.

Three of the models in the chart are named:

  • phi-1-base is a 1.3B parameter model (pre)trained with “about 8 passes” over the 7B tokens of CodeTextbook. This amounts to about 50B tokens of training data, and took took 4 days on 8 A100s.
  • phi-1 is the result of fine-tuning phi-1-base on the 180M tokens of CodeExercises. This fine-tuning took 7 hours on 8 A100s.
  • phi-1-small is made using a similar process as phi-1, but with a 350M parameter model design and apparently about 11 passes over the CodeTextbook. It takes about 2 days to train on 8 A100s.

For this part of CodeTextbook, they started with a 35B-token deduplicated and Python-restricted copy of The Stack together with code from StackOverflow referred to as Stack+ in the chart above. Then they filtered down to a 6B-token textbook-quality subset.

To do this filtering, GPT-4 is first used to determine the educational value of about 0.3% of the entire 35B-token dataset (100M tokens). The prompt used is “determine its educational value for a student whose goal is to learn basic coding concepts”.

It’s not explicitly stated why GPT-4 was chosen over GPT-3.5 for this step, since GPT-3.5 is used for all other stages of the process. However, considering the task is classifying “only” 100M tokens, the use of GPT-4 is not overly expensive will certainly yield more accurate results.

Next, these annotations are used to train another model (a random forest classifier) to classify the rest of the dataset as high or low educational value. Subsequently, this classifier is used to filter the original dataset to a 6B-token dataset of high educational quality.

This is where things get more interesting, as the authors use GPT-3.5 to generate synthetic high quality “Python textbooks”.

There is some precedent for using LLMs to generate synthetic data used to train smaller models. In an earlier Microsoft Research paper, TinyStories: How Small Can Language Models Be and Still Speak Coherent English?, the goal is to train small language models (1M to 33M parameters) to write intelligible stories at the level of toddlers, and the dataset consists entirely of stories written by GPT-3.5 and GPT-4. Quoting from the TinyStories paper:

“The main challenge in using large language models for producing training data is generating a dataset that is sufficiently diverse: prompting those models to produce stories, even if the temperature of generation is set to a high value, will still produce a very repetitive dataset, whose diversity is very far from what is required for training a language model that has a comparable “understanding” of language to that of children.”

The trick TinyStories uses to diversify synthetic data is to choose three random words (a noun, a verb, and an adjective) and a small number of “story features” for each prompt. For example, one of their prompts is the following.

Write a short story (3–5 paragraphs) which only uses very simple words that a 3 year old child would likely understand. The story should use the verb “decorate”, the noun “thunder” and the adjective “ancient”. The story should have the following features: the story should contain at least one dialogue, the story has a bad ending. Remember to only use simple words!
Source: TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Unfortunately, Microsoft Research doesn’t give us nearly as many details about their trick for generating a diverse collection of textbook-quality text, and the project does not appear to have released any code or data for us to investigate. They do say that they target the content to be “topics that prompt reasoning and basic algorithmic skills”, and that they provide constraints on the topics and on the audience of the textbook. Below is their example of a typical response to one of their prompts, quoted from the paper.

To begin, let us define singular and nonsingular matrices. A matrix is said to be singular if its determinant is zero. On the other hand, a matrix is said to be nonsingular if its determinant is not zero. Now, let’s explore these concepts through examples…
Source: Textbooks Are All You Need.

Needless to say, it would be interesting to know a lot more about this step of the process. What are the specific prompts? How are the topics chosen? What audience(s?) is GPT-3.5 told to write for? It would also be interesting to inspect CodeTextbook, but the data has not been released.

The final piece of the training data for phi-1 and phi-1-small (though not for phi-1-base) is a set of exercises and solutions that mirror the format of the HumanEval benchmark problems. Once again, this data is entirely synthetic and produced by GPT-3.5. The authors say that diversity in the outputs was achieved by constraining the function names. While the exact meaning of this is not clear to me, it might entail another model generate a list of function names and signatures first, and then prompting GPT-3.5 to generate the corresponding docstring and body. The authors provide an example of a typical output, quoted below.

def valid_guessing_letters(word: str, guesses: List[str]) -> List[str]: “”” Returns a list of valid guessing letters, which are letters that have not been guessed yet and are present in the word. Parameters: word (str): The word to guess. guesses (List[str]): A list of letters that have already been guessed. Returns: List[str]: A list of valid guessing letters. “”” valid_letters = [] for letter in word: if letter not in guesses…
Source: Textbooks Are All You Need.

The authors refer to this dataset as small because it contains only 180M tokens. However, if the example above is representative, then CodeExercises contains on the order of one million exercises and solutions.

It’s fair to be suspicious that CodeExercises is simply stumbling onto the same functions as are in the HumanEval benchmark, leading to phi-1 being fine-tuned on solutions to the very exercises it is tested on. The authors devote considerable space (all of Section 5) to arguing against this concern. They first contend that there is limited similarity between CodeExercises and HumanEval. Secondly, they argue that even when exercises in CodeExercises that bear a slight resemblance to those in HumanEval are pruned (where resemblance is measured in terms of embedding distance), models trained on the pruned datasets remain impressive.

The focus of the paper, and of this deep dive into the paper, has been on data quality. However, it’s enlightening to consider what it would cost to duplicate the experiment today, at least to consider the relative costs of its individual components.

  • Filtering. The process of filtering The Stack+ involved using GPT-4 to determine the educational value of 100,000 files, or about 100M input tokens. Ignoring the output tokens (which would be minimal) and using today’s price of $0.03 / 1K input tokens, this would cost about $3,000.
  • Synthesizing. CodeTextbook and CodeExercises together contain about 1280M tokens of GPT-3.5-generated text. At today’s price of $0.002 / 1K output tokens, creating this data would cost a little over $2,500.
  • Training. The phi-1 model was trained for 1090 hours. At today’s price of about $1/hour for an A100, this would amount to about $1,000. The 350M-parameter phi-1-small could be trained for $400.

Approximately $6,500 of compute went into the creation of phi-1.

The authors speculate that using GPT-4 for the synthesizing would be a lot better: “we also believe that significant gains could be achieved by using GPT-4 to generate the synthetic data instead of GPT-3.5, as we noticed that GPT-3.5 data has a high error rate.” But, these costs show why they didn’t. At 30 times the price of GPT-3.5, it would cost about $75,000 to generate the synthetic portion of CodeTextbook and CodeExercises with GPT-4.

The results from Textbooks Are All You Need are very impressive, especially given the smaller size of the models and the limited training data they were given. This paper is one more piece of evidence that data quality can make up for data quantity and model size.

The discussion around synthetic data will undoubtedly persist. The concept is appealing — if we don’t have high-quality data readily available, could we just synthesize it? Textbooks Are All You Need teases some promising possibilities in this area. Still, it’s not the perfect experiment we might dream of, given that only about 1B of the 7B tokens in CodeTextbook were synthetically created. But it’s worth pointing out that the other 6B tokens were filtered synthetically.

Training on entirely synthetic data has shown some exciting results in the field of image processing. The Google Research study, StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners, takes a text-to-image model and trains it entirely on synthetic data produced by Stable Diffusion. The outcomes they report match or surpass the performance of Stable Diffusion itself.

A similar approach was taken with the TinyStories paper, which relied only on synthetic data for training. But, the models it used were very small. What if larger language models were trained in the same way? The potential this presents is exciting, and it will no doubt be the focus of numerous studies in the future.

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, W. (2021). Evaluating large language models trained on code. arXiv:2107.03374.

Eldan, R. and Li, Y. (2023). TinyStories: How small can language models be and still speak coherent English? arXiv:2305.07759.

Gudibande, A., Wallace, E., Snell, C., Geng, X., Liu, H., Abbeel, P., Levine, S., and Song, D. (2023). The false promise of imitating proprietary LLMs. arXiv:2305.15717.

Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C. C. T., Giorno, A. D., Gopi, S., Javaheripi, M., Kau mann, P., de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Behl, H. S., Wang, X., Bubeck, S., Eldan, R., Kalai, A. T., Lee, Y. T., and Li, Y. (2023). Textbooks are all you need. arXiv:2306.11644.

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osin- dero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. (2022). Training compute-optimal large language models. arXiv:2203.15556.

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models. arXiv:2001.08361. Tian, Y., Fan, L., Isola, P., Chang, H., and Krishnan, D. (2023). StableRep: Synthetic images from text-to-image models make strong visual representa- tion learners. arXiv:2306.00984. Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., Zhang, S., Ghosh, G., Lewis, M., Zettlemoyer, L., and Levy, O. (2023). LIMA: Less is more for alignment. arXiv:2305.11206.



Source link

Leave a Comment