LLMs (Large Language Models) and generative AI are all the rage right now. A staggering statistic from IBM reveals that nearly 2 in 3 C-Suite executives feel pressure from investors to accelerate their adoption of generative AI. Naturally, this pressure is trickling down to Data Science and Machine Learning teams, who are responsible for navigating the hype and creating winning implementations.
As the landscape evolves, the ecosystem for LLMs has diverged between open source and industry models, with a quickly filling moat. This emerging scene has prompted many teams to consider the following question: How can we make a LLM more specific for our use case?
In this article we explore some key considerations that should be top of mind when contemplating the investment of time and engineering cycles to build a niche LLM. On this journey, it is crucial to be aware of some of the recent research surrounding potential limitations and best practices for building fine-tuned language models. After reading this article, you’ll be equipped with a few more ideas to lead your organization to the correct decision to train or not to train and how to train.
It’s no secret to anyone that OpenAI is leading the LLM charge with it’s latest iterations of GPT. For that reason many stakeholders may ask for a development team do deploy a model that imitates the results of the more robust model for various reasons (rate limits, data privacy, costs, etc.) This naturally leads developers to wonder: Can we generate outputs from GPT and utilize them to fine-tune a model?
The answer to this question remains uncertain, as it seems to depend on several factors. This particular task, known as imitation learning, involves training a new language model through fine-tuning using target observations from a more advanced model such as GPT. While this seems like a great way to get good performance out of a downstream model, it does come with its share of potential issues.
A recent paper titled “The False Promise of Imitating Proprietary LLMs”  sheds some light on potential pitfalls you may encounter with this approach. The authors present some experiments demonstrating that adding more imitation data can potentially lead to a degradation in model performance. Looking at the figure above, we can see that in the center graph that the accuracy on the benchmark task does decrease as the number of tokens increase. But why is that the case?
The authors suggest the reason this happens is that imitation models learn the style of the model they are mimicking, rather than learning and understanding the content of the model. Taking a look in the left pane of the figure above, the human reviewers favored the results of the imitation model to those of ChatGPT. After exploring it was clear that the reviewers enjoyed the style of the imitation model, but did not closely examine the content. It was noted that the content produced by the imitation model tended to have weak factuality, leading the authors to summarize “imitation models actually embody some of the worst aspects of AI assistants: their answers sound confident but are less factual than ChatGPT.”
It’s important to note that there are some scenarios where imitation models can achieve great performance. The authors point out that the imitation models can achieve good performance on local tasks, or tasks that replicate a very specific behavior of the teacher model. On a task created for the study called NQ-Synthetic, the authors task the language model with generating 10 questions and answers related to a given context. Remarkably, the imitation model achieved a score close to that of GPT. This suggests that more specific models could achieve favorable outcomes when attempting to imitate behaviors from a teacher model.
A fascinating corollary from the paper is that fine-tuning a model using a teacher model could actually help reduce the toxicity score of the imitation model. This could be extremely useful for companies that want to expose an open source LLM quickly without undergoing the laborious task of building filters surrounding the outputs. Instead of manually trying to build filters, companies could instead train on outputs from a carefully curated set of data from a teacher model to get a solid starting point.
It is worth mentioning the recent release of Orca, a model developed by Microsoft Research, which incorporates signals from GPT as part of the training data. The difference here is in the size of the training data used for the model. Orca is fine-tuned on 5 million examples whereas the imitation model for broad coverage was tuned on approximately 151 thousand observations. Since I presume most of my audience will not be spending $16,000 to train an LLM as a casual experiment, I am inclined to make statements that more closely refer to the imitation modeling paper than Orca. That being said, we will have to wait for more research as to what the minimum number of examples required for imitation learning to emerge as a viable option for broader tasks.
Takeaway: Depending on the complexity of your task, attempting to imitate the outputs of GPT or any sophisticated model with a weaker model may result in poor model performance.
In-Context Learning, or Few Shot Learning, is the process of including task-specific examples in the prompt. This approach is specific for sophisticated language models, since open source models have yet to achieve the desired flexibility to be able to handle In-Context Learning. Usually it is possible to achieve great results from this approach, but have you ever wondered why this is the case?
The answer to this question is explored in a paper by Dai et al. , where they explored the mathematical connections between loading examples in the prompt and fine-tuning using the same examples. The authors demonstrate that the prompt examples produce meta-gradients that are reflected during forward propagation at inference time. In the case of fine-tuning, the examples actually produce real gradients that are used to update the weights. Therefore, it appears that in-content learning achieves similar results to fine-tuning. For a more in-depth understanding of these finding I would encourage reading the paper, which spares no detail in the mathematical connections.
Although the approach of In-Context Learning is great, there does exist a limitation that is not evident in fine-tuning. In the case we have a large corpus of training data, a fine-tuned model will make use of all of that data by updating the model with real gradients during training. During In-Context Learning we can only provide a limited number of observations. So here a question arises: Given a substantial training corpus, how can we make use of the most relevant examples given our input to achieve the best results?
One approach to tackle this issue is by selecting examples using a heuristic, and fortunately, LangChain provides support for this. LangChain is a Python module that essentially houses pre-built prompts that simplifies working with language models. The tool from LangChain which we will concern ourselves with right now is the ExampleSelector.
def get_similarity(seq_a: str, seq_b: str) -> Union[float, int]:
Make a similarity heuristic,
here we use Jaccard similarity or IOU
seq_a: First sequence to compare
seq_b: Second sequence to compare
Similarity score (float or int)
set_a = set(seq_a.split(' '))
set_b = set(seq_b.split(' '))
# Calculate IOU/Jaccard similarity
return len(set_a.intersection(set_b)) / len(set_a.union(set_b))
def example_selector(examples: List[str], input: str, examples2use: int) -> List[str]:
Pseudo code for an example selector
examples: List of training corpus
input: Target sequence to translate
examples2use: Number of examples to use
List of selected examples
scores = [get_similarity(example, input) for example in examples]
sorted_idx = [i for i, _ in sorted(enumerate(scores), key=lambda x: x, reverse=True)]
ExampleSelectors are a type of prompt manipulator that allows us to dynamically change which examples are used during inference. There are many heuristics that can be used. Above I created some pseudo code of how a selector from LangChain essentially works. I used jaccard similarity between the input sequence and example sequence above. In LangChain there are many more options, so check them out here.
There are two primary benefits to having an approach like this. The first is that you allow your LLM to be data efficient, by selectively choosing the most relevant examples for the given input. This is opposed to having a few examples statically loaded for all observations. The second benefit comes from cost savings, if tuning through a managed service. As of writing, to use a fine-tuned base Davinci model costs $0.12 per 1,000 tokens. In contrast, using instruct Davinci costs $0.02, that’s a 400% increase in price! These prices also doesn’t include the cost of training.
It’s important to note that these prices are subject to change as OpenAI is not yet using LoRa or Adapters, as revealed in a now-deleted blog post . Nevertheless, the fine-tuned models are still likely to be more expensive due to the necessity of maintaining custom weights for individual users. This also doesn’t account for cost of examples in context. Your team will need to evaluate if ICL or fine-tuning makes more sense for your task from cost and accuracy standpoints.
Takeaway: In-Context Learning with dynamic example loading may achieve the same results as fine tuning without substantial additional costs that would come from a managed service.
Let’s say you’re trying to answer complex questions over long documents. This task fundamentally requires the language model to have a good mastery of language and understanding. This leads us to a question: What if we assist the language model in breaking down the reasoning process into subtasks, similar to how a human would analyze a document and sequentially execute tasks?
This is exactly what researchers from Microsoft set out to accomplish and their answer to this problem is PEARL . PEARL stands for Planning and Executing Actions for Reasoning over Long documents. The general framework is broken down into three steps:
- Action Mining: The language model is first prompted to read the documents and extract possible actions that could be used to answer questions that are domain specific. To extract these actions, the language model is given a few example actions. I included an example of what an action could look like below.
- Plan Generation: After generating a set of task-specific actions, the LLM is now asked to generate a subsequent list of actions to execute in order given a question and context. The LLM is provided some examples of plans for other tasks which aids in construction of a quality plan. More details about the technicalities can be found in the paper.
- Plan Execution: The model now has the plan. We now provide the inputs to the model and execute the plan.
There are some intermediary steps that are used to ensure quality between stages. The authors include a self-correction step which ensures the plan conforms to the required format. There is also a self-refinement step that determines if the plan can be used later as a few-shot example.
In evaluation PEARL demonstrated notable improvements over other GPT models, specifically when long documents were included. The key takeaway from this process is that in certain cases having multiple steps can significantly assist the model.
Another scenario when having intermediate steps proves beneficial is when the number of documents to be included in your context exceeds what is supported by the language model. As it currently stands, the attention mechanism used by OpenAI scales at O(n²) and there is no solution to overcome this yet . This creates considerable interest in reducing the context to the most minimal form possible.
Depending on your task there are ways to handle this. For instance, if your task entirely revolves around entities there is an opportunity to extract the relevant entities and their related properties. You can think of this approach as a lossy compression that allows you to feed more context into the LLM. Another benefit of this intermediate step is that you converted unstructured data to a structured format, which allows you to create informed decision making without the LLM. An example of this task is shown below in the figure from Fei et al. .
Takeaway: Breaking a task into smaller subsequent problems can help simplify a larger problem into more manageable pieces. You can also use these smaller tasks to solve bottlenecks related to model limitations.
These are some general ideas regarding what researchers are exploring in the new frontiers of LLM performance and efficiency. This is not an exhaustive list of all things to be considered when fine-tuning a model, but it’s a good place to start when considering the journey.
For further reading, this post from Hugging Face regarding training LLMs is quite interesting, and would be a great place to start when exploring imitation models on a local problem. Getting a concrete understanding of LangChain is also supremely helpful. While most of the library could be rewritten for your use case, the main benefit is that it’s easier to keep up with research if other people are writing the code for you!
Here are the takeaways again:
- Depending on the complexity of your task, attempting to imitate the outputs of GPT or any sophisticated model with a weaker model may result in poor model performance.
- In-Context Learning with dynamic example loading may achieve the same results as fine tuning without substantial additional costs that would come from a managed service.
- Breaking a task into smaller subsequent problems can help simplify a larger problem into more manageable pieces. You can also use these smaller tasks to solve bottlenecks related to model limitations.