Even though the prompt for these two examples was structured in the same way, the responses differed in a few key ways. Response 1 attempts to create a
DatasetView by adding
ViewStage to the dataset. Response 2 defines and applies a MongoDB aggregation pipeline, followed by the
limit() method (applying
Limit stage) to limit the view to 10 samples, as well as a non-existent (AKA hallucinated)
display() method. Additionally, while Response 1 loads in an actual dataset (Open Images V6), Response 2 is effectively template code, as
"your_model_name” need to be filled in.
These examples also highlighted the following issues:
- Boilerplate code: some responses contained code for importing modules, instantiating datasets (and models), and visualizing the view (
session = fo.launch_app(dataset)).
- Explanatory text: in many cases — including educational contexts — the fact that the model explains its “reasoning” is a positive. If we want to perform queries on the user’s behalf, however, this explanatory text just gets in the way. Some queries even resulted in multiple code blocks, split up by text.
What we really wanted was for the LLM to respond with code that could be copied and pasted into a Python process, without all of the extra baggage. As a first attempt at prompting the model, I started to give the following text as prefix to any natural language query I wanted it to translate:
Your task is to convert input natural language queries into Python code to generate ViewStages for the computer vision library FiftyOne.
Here are some rules:
- Avoid all header code like importing packages, and all footer code like saving the dataset or launching the FiftyOne App.
- Just give me the final Python code, no intermediate code snippets or explanation.
- always assume the dataset is stored in the Python variable `dataset`
- you can use the following ViewStages to generate your response, in any combination: exclude, exclude_by, exclude_fields, exclude_frames, …
Crucially, I defined a task, and set rules, instructing the model what it was allowed and not allowed to do.
Note: with responses coming in a more uniform format, it was at this point that I moved from the ChatGPT chat interface to using GPT-4 via OpenAI’s API.
Our team also decided that, at least to start, we would limit the scope of what we were asking the LLM to do. While the fiftyone query language itself is full-bodied, asking a pre-trained model to do arbitrarily complex tasks without any fine-tuning is a recipe for disappointment. Start simple, and iteratively add in complexity.
For this experiment, we imposed the following bounds:
- Just images and videos: don’t expect the LLM to query 3D point clouds or grouped datasets.
- Ignore fickle
ViewStagesabide by the same basic rules, but a few buck the trend. `Concat` is the only
ViewStages` that takes in a second
Mongouses MongoDB Aggregation syntax;
queryargument, which takes in a
GeoWithinrequires a 2D array to define the region to which the “within” applies. We decided to ignore
GeoWithin, and to support all
GeoNearusage except for the
- Stick to two stages: while it would be great for the model to compose an arbitrary number of stages, in most workflows I’ve seen, one or two
ViewStagessuffice to create the desired
DatasetView. The goal of this project was not to get caught in the weeds, but to build something useful for computer vision practitioners.
In addition to giving the model an explicit “task” and providing clear instructions, we found that we could improve performance by giving the model more information about how FiftyOne’s query language works. Without this information, the LLM is flying blind. It is just grasping, reaching out into the darkness.
For example, in Prompt 2, when I asked for false positive predictions, the response attempted to reference these false positives with
predictions.mistakes.false_positive. As far as ChatGPT was concerned, this seemed like a reasonable way to store and access information about false positives.
The model didn’t know that in FiftyOne, the truth/falsity of detection predictions is evaluated with
dataset.evaluate_detections() and after running said evaluation, you can retrieve all images with a false positive by matching for
images_with_fp = dataset.match(F("eval_fp")>0)
I tried to clarify the task by providing additional rules, such as:
- When a user asks for the most "unique" images, they are referring to the "uniqueness" field stored on samples.
- When a user asks for the most "wrong" or "mistaken" images, they are referring to the "mistakenness" field stored on samples.
- If a user doesn't specify a label field, e.g. "predictions" or "ground_truth" to which to apply certain operations, assume they mean "ground_truth" if a ground_truth field exists on the data.
I also provided information about label types:
- Object detection bounding boxes are in [top-left-x, top-left-y, width, height] format, all relative to the image width and height, in the range [0, 1]
- possible label types include Classification, Classifications, Detection, Detections, Segmentation, Keypoint, Regression, and Polylines
Additionally, while by providing the model with a list of allowed view stages, I was able to nudge it towards using them, it didn’t know
- When a given stage was relevant, or
- How to use the stage in a syntactically correct manner
To fill this gap, I wanted to give the LLM information about each of the view stages. I wrote code to loop through view stages (which you can list with
fiftyone.list_view_stages()), store the docstring, and then split the text of the docstring into description and inputs/arguments.
However, I soon ran into a problem: context length.
Using the base GPT-4 model via the OpenAI API, I was already bumping up against the 8,192 token context length. And this was before adding in examples, or any information about the dataset itself!
OpenAI does have a GPT-4 model with a 32,768 token context which in theory I could have used, but a back-of-the-envelope calculation convinced me that this could get expensive. If we filled the entire 32k token context, given OpenAI’s pricing, it would cost about $2 per query!
Instead, our team rethought our approach and did the following:
- Switch to GPT-3.5
- Minimize token count
- Be more selective with input info
Switching to GPT-3.5
There’s no such thing as a free lunch — this did lead to slightly lower performance, at least initially. Over the course of the project, we were able to recover and far surpass this through prompt engineering! In our case, the effort was worth the cost savings. In other cases, it might not be.
Minimizing Token Count
With context length becoming a limiting factor, I employed the following simple trick: use ChatGPT to optimize prompts!
ViewStage at a time, I took the original description and list of inputs, and fed this information into ChatGPT, along with a prompt asking the LLM to minimize the token count of that text while retaining all semantic information. Using tiktoken to count the tokens in the original and compressed versions, I was able to reduce the number of tokens by about 30%.
Being More Selective
While it’s great to provide the model with context, some information is more helpful than other information, depending on the task at hand. If the model only needs to generate a Python query involving two
ViewStages, it probably won’t benefit terribly from information about what inputs the other
We knew that we needed a way to select relevant information depending on the input natural language query. However, it wouldn’t be as simple as performing a similarity search on the descriptions and input parameters, because the former often comes in very different language than the latter. We needed a way to link input and information selection.
That link, as it turns out, was examples.
If you’ve ever played around with ChatGPT or another LLM, you’ve probably experienced first-hand how providing the model with even just a single relevant example can drastically improve performance.
As a starting point, I came up with 10 completely synthetic examples and passed these along to GPT-3.5 by adding this below the task rules and
ViewStage descriptions in my input prompt:
Here are a few examples of Input-Output Pairs in A, B form:
A) "Filepath starts with '/Users'"
A) "Predictions with confidence > 0.95"
B) `dataset.filter_labels("predictions", F("confidence") > 0.95)`
With just these 10 examples, there was a noticeable improvement in the quality of the model’s responses, so our team decided to be systematic about it.
- First, we combed through our docs, finding any and all examples of views created through combinations of
- We then went through the list of
ViewStagesand added examples so that we had as close to complete coverage as possible over usage syntax. To this, we made sure that there was at least one example for each argument or keyword, to give the model a pattern to follow.
- With usage syntax covered, we varied the names of fields and classes in the examples so that the model wouldn’t generate any false assumptions about names correlating with stages. For instance, we don’t want the model to strongly associate the “person” class with the
match_labels()method just because all of the examples for
match_labels()happen to include a “person” class.
Selecting Similar Examples
At the end of this example generation process, we already had hundreds of examples — far more than could fit in the context length. Fortunately, these examples contained (as input) natural language queries that we could directly compare with the user’s input natural language query.
To perform this comparison, we pre-computed embeddings for these example queries with OpenAI’s text-embedding-ada–002 model. At run-time, the user’s query is embedded with the same model, and the examples with the most similar natural language queries — by cosine distance — are selected. Initially, we used ChromaDB to construct an in-memory vector database. However, given that we were dealing with hundreds or thousands of vectors, rather than hundreds of thousands or millions, it actually made more sense to switch to an exact vector search (plus we limited dependencies).
It was becoming difficult to manage these examples and the components of the prompt, so it was at this point that we started to use LangChain’s Prompts module. Initially, we were able to use their Similarity ExampleSelector to select the most relevant examples, but eventually we had to write a custom
ExampleSelector so that we had more control over the pre-filtering.
Filtering for Appropriate Examples
In the computer vision query language, the appropriate syntax for a query can depend on the media type of the samples in the dataset: videos, for example, sometimes need to be treated differently than images. Rather than confuse the model by giving seemingly conflicting examples, or complicating the task by forcing the model to infer based on media type, we decided to only give examples that would be syntactically correct for a given dataset. In the context of vector search, this is known as pre-filtering.
This idea worked so well that we eventually applied the same considerations to other features of the dataset. In some cases, the differences were merely syntactic — when querying labels, the syntax for accessing a
Detections label is different from that of a
Classification label. Other filters were more strategic: sometimes we didn’t want the model to know about a certain feature of the query language.
For instance, we didn’t want to give the LLM examples utilizing computations it would not have access to. If a text similarity index had not been constructed for a specific dataset, it would not make sense to feed the model examples of searching for the best visual matches to a natural language query. In a similar vein, if the dataset did not have any evaluation runs, then querying for true positives and false positives would yield either errors or null results.
You can see the complete example pre-filtering pipeline in view_stage_example_selector.py in the GitHub repo.
Choosing Contextual Info Based on Examples
For a given natural language query, we then use the examples selected by our
ExampleSelector to decide what additional information to provide in the context.
In particular, we count the occurrences of each
ViewStage in these selected examples, identify the five most frequent `
ViewStages, and add the descriptions and information about the input parameters for these
ViewStages as context in our prompt. The rationale for this is that if a stage frequently occurs in similar queries, it is likely (but not guaranteed) to be relevant to this query.
If it is not relevant, then the description will help the model to determine that it is not relevant. If it is relevant, then information about input parameters will help the model generate a syntactically correct
Up until this point, we had focused on squeezing as much relevant information as possible — and just relevant information — into a single prompt. But this approach was reaching its limits.
Even without accounting for the fact that every dataset has its own names for fields and classes, the space of possible Python queries was just too large.
To make progress, we needed to break the problem down into smaller pieces. Taking inspiration from recent approaches, including Chain-of-thought prompting and Selection-inference prompting, we divided the problem of generating a
DatasetView into four distinct selection subproblems
- Runs of algorithms
- Relevant fields
- Relevant class names
We then chained these selection “links” together, and passed their outputs along to the model in the final prompt for
For each of these subtasks, the same principles of uniformity and simplicity apply. We tried to recycle the natural language queries from existing examples wherever possible, but made a point to simplify the formats of all inputs and outputs for each selection task. What is simplest for one link may not be simplest for another!
In FiftyOne, information resulting from a computation on a dataset is stored as a “run”. This includes computations like
uniqueness, which measures how unique each image is relative to the rest of the images in the dataset, and
hardness, which quantifies the difficulty a model will experience when trying to learn on this sample. It also includes computations of
similarity, which involve generating a vector index for embeddings associated with each sample, and even
evaluation computations, which we touched upon earlier.
Each of these computations generates a different type of results object, which has its own API. Furthermore, there is not any one-to-one correspondence between
ViewStages and these computations. Let’s take uniqueness as an example.
A uniqueness computation result is stored in a float-valued field (
"uniqueness” by default) on each image. This means that depending on the situation, you may want to sort by uniqueness:
view = dataset.sort_by("uniqueness")
Retrieve samples with uniqueness above a certain threshold:
from fiftyone import ViewField as F
view = dataset.match(F("uniqueness") > 0.8)
Or even just show the uniqueness field:
view = dataset.select_fields("uniqueness")
In this selection step, we task the LLM with predicting which of the possible computations might be relevant to the user’s natural language query. An example for this task looks like:
Query: "most unique images with a false positive"
Algorithms used: ["uniqueness", "evaluation"]
Runs of Algorithms
Once potentially relevant computational algorithms have been identified, we task the LLM with selecting the most appropriate run of each computation. This is essential because some computations can be run multiple times on the same dataset with different configurations, and a
ViewStage may only make sense with the right “run”.
A great example of this is similarity runs. Suppose you are testing out two models (InceptionV3 and CLIP) on your data, and you have generated a vector similarity index on the dataset for each model. When using the
SortBySimilarity view stage, which images are determined to be most similar to which other images can depend quite strongly on the embedding model, so the following two queries would need to generate different results:
## query A:
"show me the 10 most similar images to image 1 with CLIP"
## query B:
"show me the 10 most similar images to image 1 with InceptionV3"
This run selection process is handled separately for each type of computation, as each requires a modified set of task rules and examples.
This link in the chain involves identifying all field names relevant to the natural language query that are not related to a computational run. For instance not all datasets with predictions have those labels stored under the name
"predictions”. Depending on the person, dataset, and application, predictions might be stored in a field named
"predictions_05_16_2023", or something else entirely.
Examples for this task included the query, the names and types of all fields in the dataset, and the names of relevant fields:
Query: "Exclude model2 predictions from all samples"
Available fields: "[id: string, filepath: string, tags: list, ground_truth: Detections, model1_predictions: Detections, model2_predictions: Detections, model3_predictions: Detections]"
Required fields: "[model2_predictions]"
Relevant Class Names
For label fields like classifications and detections, translating a natural language query into Python code requires using the names of actual classes in the dataset. To accomplish this, I tasked GPT-3.5 with performing named entity recognition for label classes in input queries.
In the query “samples with at least one cow prediction and no horses”, the model’s job is to identify
"cow". These identified names are then compared against the class names for label fields selected in the prior step — first case sensitive, then case insensitive, then plurality insensitive.
If no matches are found between named entities and the class names in the dataset, we fall back to semantic matching:
"dining table", and
[“cat”, “dog", “horse", …].
Whenever the match is not identical, we use the names of the matched classes to update the query that is passed into the final inference step:
query: "20 random images with a table"
query: "20 random images with a dining table"
Once all of these selections have been made, the similar examples, relevant descriptions, and relevant dataset info (selected algorithmic runs, fields, and classes) are passed in to the model, along with the (potentially modified) query.
Rather than instruct the model to return code to me in the form
dataset.view1().view2()…viewn() as we were doing initially, we ended up nixing the
dataset part, and instead asking the model to return the
ViewStages as a list. At the time, I was surprised to see this improve performance, but in hindsight, it fits with the insight that the more you split the task up, the better an LLM can do.
Creating an LLM-powered toy is cool, but turning the same kernel into an LLM-power application is much cooler. Here’s a brief overview of how we did it.
As we turned this from a proof-of-principle into a robustly engineered system, we used unit testing to stress test the pipeline and identify weak points. The modular nature of links in the chain means that each step can individually be unit tested, validated, and iterated on without needing to run the entire chain.
This leads to faster improvement, because different individuals or groups of people within a prompt-engineering team can work on different links in the chain in parallel. Additionally, it results in reduced costs, as in theory, you should only need to run a single step of LLM inference to optimize a single link in the chain.
Evaluating LLM-Generated Code
We used Python’s
eval() function to turn GPT-3.5’s response into a
DatasetView. We then set the state of the FiftyOne App
session to display this view.
Garbage input → garbage output. To avoid this, we run validation to make sure that the user’s natural language query is sensible.
First, we use OpenAI’s moderation endpoint. Then we categorize any prompt into one of the following four cases:
1: Sensible and complete: the prompt can reasonably be translated into Python code for querying a dataset.
All images with dog detections
2: Sensible and incomplete: the prompt is reasonable, but cannot be converted into a DatasetView without additional information. For example, if we have two models with predictions on our data, then the following prompt, which just refers to “my model” is insufficient:
Retrieve my model’s incorrect predictions
3: Out of scope: we are building an application that generates queried views into computer vision datasets. While the underlying GPT-3.5 model is a general purpose LLM, our application should not turn into a disconnected ChatGPT session next to your dataset. Prompts like the following should be snuffed out:
Explain quantum computing like I’m five
4: Not sensible: given a random string, it would not make sense to attempt to generate a view of the dataset — where would one even start?!