Bootstrapping Labels with GPT-4. A cost-effective approach to data… | by Jimmy Whitaker | Jun, 2023


A cost-effective approach to data labeling

Data labeling is a critical component for machine learning projects. It is built on the old adage, “garbage in, garbage out.” Labeling involves creating annotated datasets for training and evaluation. But this process can be time-consuming and expensive, especially for projects with lots of data. But, what if we could use the advances in LLMs to reduce the cost and effort involved in data labeling tasks?

GPT-4 is a state-of-the-art language model developed by OpenAI. It has a remarkable ability to understand and generate human-like text and has been a game changer in the natural language processing (NLP) community and beyond. In this blog post, we’ll explore how you can use GPT-4 to bootstrap labels for various tasks. This can significantly reduce the time and cost involved in the labeling process. We’ll focus on sentiment classification to demonstrate how prompt engineering can enable you to create accurate and reliable labels using GPT-4 and how this technique can be used for much more powerful things as well.

As in writing, editing is often less strenuous than composing the original work. That’s why starting with pre-labeled data is more attractive than starting with a blank slate. Using GPT-4 as a prediction engine to pre-label data stems from its ability to understand context and generate human-like text. Therefore, it would be excellent to leverage GPT-4 to reduce the manual effort required for data labeling. This could result in cost savings and make the labeling process less mundane.

So how do we do this? If you’ve used GPT models, you’re probably familiar with prompts. Prompts set the context for the model before it begins generating output and can be tweaked and engineered (i.e. prompt engineering) to help the model deliver highly specific results. This means we can create prompts that GPT-4 can use to generate text that looks like model predictions. For our use case, we will craft our prompts in a way that guides the model toward producing the desired output format as well.

Let’s take a straightforward example of sentiment analysis. If we are trying to classify the sentiment of a given string of text as positive, negative, or neutral we could provide a prompt like:

"Classify the sentiment of the following text as 'positive', 'negative', or 'neutral': <input_text>"

Once we have a well-structured prompt, we can use the OpenAI API to generate predictions from GPT-4. Here’s an example using Python:

import openai
import re

openai.api_key = "<your_api_key>"

def get_sentiment(input_text):
prompt = f"Respond in the json format: {{'response': sentiment_classification}}nText: {input_text}nSentiment (positive, neutral, negative):"
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": prompt}
],
max_tokens=40,
n=1,
stop=None,
temperature=0.5,
)
response_text = response.choices[0].message['content'].strip()
sentiment = re.search("negative|neutral|positive", response_text).group(0)
# Add input_text back in for the result
return {"text": input_text, "response": sentiment}

We can run this with a single example to inspect the output we’re receiving from the API.

# Test single example
sample_text = "I had a terrible time at the party last night!"
sentiment = get_sentiment(sample_text)
print("Resultn",f"{sentiment}")
Result:
{'text': 'I had a terrible time at the party last night!', 'response': 'negative'}

Once we’re satisfied with our prompt and the results we’re getting, we can scale this up to our entire dataset. Here, we’ll assume a text file with one example per line.

import json

input_file_path = "input_texts.txt"
output_file_path = "output_responses.json"

with open(input_file_path, "r") as input_file, open(output_file_path, "w") as output_file:
examples = []
for line in input_file:
text = line.strip()
if text:
examples.append(convert_ls_format(get_sentiment(text)))
output_file.write(json.dumps(examples))

We can import the data with pre-labeled predictions into Label Studio and have reviewers verify or correct the labels. This approach significantly reduces the manual work required for data labeling, as human reviewers only need to validate or correct the model-generated labels rather than annotate the entire dataset from scratch. See our full example notebook here.

Note that in most situations, OpenAI is allowed to use any information sent to their APIs to train their models further. So it’s important to not send protected or private data to these APIs for labeling if we don’t want to expose the information more broadly.

Once we have our pre-labeled data ready, we will import it into a data labeling tool, such as Label Studio, for review. This section will guide you through setting up a Label Studio project, importing the pre-labeled data, and reviewing the annotations.

Figure 1: Reviewing Sentiment Classification in Label Studio. (Image by author, screenshot with Label Studio)

Step 1: Install and Launch Label Studio

First, you need to have Label Studio installed on your machine. You can install it using pip:

pip install label-studio

After installing Label Studio, launch it by running the following command:

label-studio

This will open Label Studio in your default web browser.

Step 2: Create a New Project

Click on “Create Project” and enter a project name, such as “Review Bootstrapped Labels.” Next, you need to define the labeling configuration. For Sentiment Analysis, we can use the text Sentiment Analysis Text Classification.

These templates are configurable, so if we want to change any of the properties, it’s really straightforward. The default labeling configuration is shown below.

<View>
<Header value="Choose text sentiment:"/>
<Text name="my_text" value="$reviewText"/>
<Choices name="sentiment" toName="my_text" choice="single" showInline="true">
<Choice value="Positive"/>
<Choice value="Negative"/>
<Choice value="Neutral"/>
</Choices>
</View>

Click “Create” to finish setting up the project.

Step 3: Import Pre-labeled Data

To import the pre-labeled data, click the “Import” button. Choose the json file and select the pre-labeled data file generated earlier (e.g., “output_responses.json”). The data will be imported along with the pre-populated predictions.

Step 4: Review and Update Labels

After importing the data, you can review the model-generated labels. The annotation interface will display the pre-labeled sentiment for each text sample, and reviewers can either accept or correct the suggested label.

You can improve quality further by having multiple annotators review each example.

By utilizing GPT-4-generated labels as a starting point, the review process becomes much more efficient, and reviewers can focus on validating or correcting the annotations rather than creating them from scratch.

Step 5: Export Labeled Data

Once the review process is complete, you can export the labeled data by clicking the “Export” button in the “Data Manager” tab. Choose the desired output format (e.g., JSON, CSV, or TSV), and save the labeled dataset for further use in your machine learning project.



Source link

Leave a Comment