Ask GPT-4 to prove there are infinite prime numbers — while rhyming — and it delivers. But ask it how your company did last quarter, and it will fail miserably. This illustrates a fundamental challenge of large language models (“LLMs”): they have a good grasp of general, public knowledge, but are entirely unaware of proprietary, non-public information. And proprietary information is critical to the vast majority of enterprise use workflows. A model that understands the public internet is cute, but little use in its raw form to most organizations.
Over the past year, I’ve had the privilege of working with a number of organizations applying LLMs to enterprise use cases. This post details key concepts and concerns that anyone embarking on such a journey should know, as well as a few hot-takes on how I think LLMs will evolve and implications for ML product strategy. It’s intended for product managers, designers, engineers and other readers with limited or no knowledge of how LLMs work “under the hood”, but some interest in learning the concepts without going into technical details.
Prompt Engineering, Context Windows, and Embeddings
The simplest way to make an LLM reason about proprietary data is to provide the proprietary data in the model’s prompt. Most LLMs would have no problem answering the following correctly: “We have 2 customers, A and B, who spent $100K and $200K, respectively. Who was our largest customer and how much did they spend?” We’ve just done some basic prompt engineering, by prepending our query (the second sentence) with context (the first sentence).
Embeddings get the information necessary to answer the question into the context. Embeddings are a method in which text is transformed into numerical vectors, in which similar text generates similar vectors (vectors that are “close together” in N-dimensional space). We might embed website text, documents, maybe even an entire corpus from SharePoint, Google Docs, or Notion. Then, for each user prompt, we embed it, we do a similarity search between the prompt vector and the vectorized text corpus. For example, if we embedded Wikipedia pages on animals, and the user asked a question about safaris, our similarity search would rank highly the Wikipedia articles about lions, zebra, and giraffes. This allows us to identify the text chunks most similar to the prompt — and thus most likely to answer it. We include these most similar text chunks in the context that is prepended to the prompt, so that the prompt hopefully contains all the information necessary for the LLM to answer the question.
A downside of embeddings is that every call to the LLM requires all the context to be passed with the prompt. The LLM has no “memory” of even the most basic enterprise-specific concepts. And since most cloud-based LLM providers charge per prompt token, this can get expensive fast.
Fine-tuning helps allow an LLM to understand enterprise-specific concepts without including them in each prompt. We take a foundation model, which already encodes general knowledge across billions of learned parameters, and tweak those parameters to reflect specific enterprise knowledge, while still retaining the underlying general knowledge. When we generate inferences with the new fine-tuned model, we get that enterprise knowledge “for free”.
In contrast with embeddings/prompt engineering, where the underlying model is a third-party black box, fine-tuning is closer to classical machine learning, where ML teams created their own models from scratch. Fine-tuning requires a training dataset with labeled observations; the fine-tuned model is highly sensitive to the quality and volume of that training data. We also need to make configuration decisions (number of epochs, learning rate, etc), orchestrate long-running training jobs, and track model versions. Some foundation model providers provide APIs that abstract away some of this complexity, some do not.
While inferences may be cheaper with fine-tuned models, that can be outweighed by costly training jobs. And some foundation model providers (like OpenAI) only support fine-tuning of lagging-edge models (so not ChatGPT or GPT-4).
One of the novel, significant challenges presented by LLMs is measuring the quality of complex outputs. Classical ML teams have tried-and-true methods for measuring the accuracy of simple outputs, like numerical predictions or categorizations. But most enterprise use cases for LLMs involve generating responses that are tens to thousands of words. Concepts sophisticated enough to require more than ten words can normally be worded in many ways. So even if we have a human-validated “expert” response, doing an exact string match of a model response to the expert response is too stringent a test, and would underestimate model response quality.
The Evals framework, open-sourced by OpenAI, is one approach to tackling this problem. This framework requires a labeled test set (where prompts are matched to “expert” responses), but it allows broad types of comparison between model and expert responses. For example, is the model-generated answer: a subset or superset of the expert answer; factually equivalent to the expert answer; more or less concise than the expert answer? The caveat is that Evals perform these checks using an LLM. If there’s a flaw in the “checker” LLM, the eval results may themselves be inaccurate.
If you’re using an LLM in production, you need to have confidence that it will handle misguided or malicious user inputs safely. For most enterprises, the starting point is ensuring the model doesn’t spread fake news. That means a system that knows its own limitations and when to say “I don’t know.” There are many tactical approaches here. It can be done via prompt engineering with prompt language like “Respond ‘I don’t know’ if the question cannot be answered with the context provided above”). It can be done with fine-tuning, by providing out-of-scope training examples, where the expert response is “I don’t know”.
Enterprises also need to guard against malicious user inputs, e.g. prompt hacking. Limiting the format and length of the system’s acceptable inputs and outputs can be an easy and effective start. Precautions are a good idea if you’re only serving internal users and they’re essential if you’re serving external users.
The developers of the most popular LLMs (OpenAI / GPT-4, Google / Bard) have taken pains to align their models with human preferences and deploy sophisticated moderation layers. If you ask GPT-4 or Bard to tell you a racist or misogynistic joke, they will politely refuse.
That’s good news. The bad news is that this moderation, which targets societal biases, doesn’t necessarily prevent institutional biases. Imagine our customer support team has a history of being rude to a particular type of customer. If historical customer support conversations are naively used to construct a new AI system (for example, via fine-tuning) that system is likely to replicate that bias.
If you’re using past data to train an AI model (be it a classical model or a generative model), closely scrutinize which past situations you want to perpetuate into the future and which you do not. Sometimes it’s easier to set principles and work from those (for example, via prompt engineering), without using past data directly.
Unless you’ve been living under a rock, you know generative AI models are advancing incredibly rapidly. Given an enterprise use case, the best LLM for it today may not be the best solution in six months and almost certainly will not be the best solution in six years. Smart ML teams know they will need to switch models at some point.
But there are two other major reasons to build for easy LLM “swapping”. First, many foundation model providers have struggled to support exponentially-growing user volume, leading to outages and degraded service. Building a fallback foundation model into your system is a good idea. Second, it can be quite useful to test multiple foundation models in your system (“a horse race”) to get a sense of which performs best. Per the section above on Evals, it’s often difficult to measure model quality analytically, so sometimes you just want to run two models and qualitatively compare the responses.
Read the terms and conditions of any foundation model you’re considering using. If the model provider has the right to use user inputs for future model training, that’s worrisome. LLMs are so large it’s possible that specific user queries/responses become directly encoded in a future model version, and could then become accessible to any user of that version. Imagine a user at your organization queries “how can I clean up this code that does XYZ? [your proprietary, confidential code here]” If this query is then used by the model provider to retrain their LLM, that new version of the LLM may learn that your proprietary code is a great way to solve use case XYZ. If a competitor asks how to do XYZ, the LLM could “leak” your source code, or something very similar.
OpenAI now allows users to opt-out of their data being used to train models, which is a good precedent, but not every model provider has followed their example. Some organizations are also exploring running LLMs within their own virtual private clouds; this is a key reason for much of the interest in open-source LLMs.
Prompt Engineering Will Dominate Fine Tuning
When I first started adapting LLMs for enterprise use, I was much more interested in fine tuning than prompt engineering. Fine tuning felt like it adhered to the principles of classical ML systems to which I was accustomed: wrangle some data, produce a train/test dataset, kick off a training job, wait a while, evaluate the results against some metric.
But I’ve come to believe that prompt engineering (with embeddings) is a better approach for most enterprise use cases. First, the iteration cycle for prompt engineering is far faster than for fine tuning, because there is no model training, which can take hours or days. Changing a prompt and generating new responses can be done in minutes. Conversely, fine-tuning is an irreversible process in terms of model training; if you used incorrect training data or a better base model comes out, you need to restart your fine-tuning jobs. Second, prompt engineering requires far less knowledge of ML concepts like neural network hyperparameter optimization, training job orchestration or data wrangling. Fine-tuning often requires experienced ML engineers, while prompt engineering can often be done by software engineers without ML experience. Third, prompt engineering works better for the fast-growing strategy of model chaining, in which complex requests are decomposed into smaller, constituent requests, each of which can be assigned to a different LLM. Sometimes the best “constituent model” is a fine-tuned model. But most of the value-add work for enterprises is (i) figuring out how to break apart their problem, (ii) write the prompts for each constituent part, and (iii) identify the best off-the-shelf model for each part; it’s not in creating their own fine-tuned models.
The advantages of prompt engineering are likely to widen over time. Today, prompt engineering requires long, expensive prompts (since context must be included in each prompt). But I’d bet on rapidly declining cost per token, as the model provider space gets more competitive and providers figure out how to train LLMs more cheaply. Prompt engineering is also limited today by maximum prompt sizes. Today, OpenAI accepts up to 32K tokens (~40 pages of average English text) per prompt for GPT-4. Not bad! And I’d bet on larger context windows coming out in the near future.
Data Won’t Be The Moat It Once Was
As LLMs have become better at producing human-interpretable reasoning, its useful to consider how humans use data to reason, and what that implies for LLMs. Humans don’t actually use much data! Most of the time, we do “zero shot learning”, which simply means we answer questions without the question being accompanied by a set of example question-answer pairs. The questioner just provides the question, and we answer based on logic, principles, heuristics, biases, etc.
This is different from the LLMs of just a few years ago, which were only good at few-shot learning, where you needed to include a handful of example question-answer pairs in your prompt. And it’s very different from classical ML, where the model needed to be trained on hundreds, thousands, or millions of question-answer pairs.
I strongly believe that an increasing, dominant share of LLM use cases will be “zero-shot”. LLMs will be able to answer most questions without any user-provided examples. They will need prompt engineering, in the form of instructions, policies, assumptions, etc. For example, this post uses GPT-4 to review code for security vulnerabilities; the approach requires no data on past instances of vulnerable code. Having clear instructions, policies, and assumptions will become increasingly important — but having large volumes of high-quality, labeled, proprietary data will become less important.