Getting Started with Weaviate: A Beginner’s Guide to Search with Vector Databases | by Leonie Monigatti | Jul, 2023


How to use vector databases for semantic search, question answering, and generative search in Python with OpenAI and Weaviate

If you landed on this article, I assume you have been playing around with building an app with a large language model (LLM) and came across the term vector database.

The tool landscape around building apps with LLMs is growing rapidly, with tools such as LangChain or LlamaIndex gaining popularity.

In a recent article, I described how to get started with LangChain, and in this article, I want to continue exploring the LLM tool landscape by playing around with Weaviate.

Weaviate is an open-source vector database. It enables you to store data objects and vector embeddings and query them based on similarity measures.

Vector databases have been getting much attention since the rise of media attention on LLMs. Probably the most popular use case of vector databases in the context of LLMs is to “provide LLMs with long-term memory”.

If you need a refresher on the concept of vector databases, you might want to have a look at my previous article:

In this tutorial, we will walk through how to populate a Weaviate vector database with embeddings of your dataset. Then we will go over three different ways you can retrieve information from it:

To follow along in this tutorial, you will need to have the following:

  • Python 3 environment
  • OpenAI API key (or alternatively, an API key for Hugging Face, Cohere, or PaLM)

A note on the API key: In this tutorial, we will generate embeddings from text via an inference service (in this case, OpenAI). Depending on which inference service you use, make sure to check the provider’s pricing page to avoid unexpected costs. E.g., the used Ada model (version 2) costs $0.0001 per 1,000 tokens at the time of writing and resulted in less than 1 cent in inference costs for this tutorial.

You can run Weaviate either on your own instances (using Docker, Kubernetes, or Embedded Weaviate) or as a managed service using Weaviate Cloud Services (WCS). For this tutorial, we will run a Weaviate instance with WCS, as this is the recommended and most straightforward way.

How to Create a Cluster with Weaviate Cloud Services (WCS)

To be able to use the service, you first need to register with WCS.

Once you are registered, you can create a new Weaviate Cluster by clicking the “Create cluster” button.

Screenshot of Weaviate Cloud Services

For this tutorial, we will be using the free trial plan, which will provide you with a sandbox for 14 days. (You won’t have to add any payment information. Instead, the sandbox simply expires after the trial period. But you can create a new free trial sandbox anytime.)

Under the “Free sandbox” tab, make the following settings:

  1. Enter a cluster name
  2. Enable Authentication (set to “YES”)
Screenshot of Weaviate Cloud Services plans

Finally, click “Create” to create your sandbox instance.

How to Install Weaviate in Python

Last but not least, add the weaviate-client to your Python environment with pip

$ pip install weaviate-client

and import the library:

import weaviate

How To Access a Weaviate Cluster Through a Client

For the next step, you will need the following two pieces of information to access your cluster:

  • The cluster URL
  • Weaviate API key (under “Enabled — Authentication”)
Screenshot of Weaviate Cloud Services sandbox

Now, you can instantiate a Weaviate client to access your Weaviate cluster as follows.

auth_config = weaviate.AuthApiKey(api_key="YOUR-WEAVIATE-API-KEY")  # Replace w/ your Weaviate instance API key

# Instantiate the client
client = weaviate.Client(
url="https://<your-sandbox-name>.weaviate.network", # Replace w/ your Weaviate cluster URL
auth_client_secret=auth_config,
additional_headers={
"X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY", # Replace with your OpenAI key
}
)

As you can see, we are using the OpenAI API key under additional_headers to access the embedding model later. If you are using a different provider than OpenAI, change the key parameter to one of the following that apply: X-Cohere-Api-Key, X-HuggingFace-Api-Key, or X-Palm-Api-Key.

To check if everything is set up correctly, run:

client.is_ready()

If it returns True, you’re all set for the next steps.

Now, we’re ready to create a vector database in Weaviate and populate it with some data.

For this tutorial, we will use the first 100 rows of the 200.000+ Jeopardy Questions dataset [1] from Kaggle.

import pandas as pd

df = pd.read_csv("your_file_path.csv", nrows = 100)

First few rows of the 200.000+ Jeopardy Questions dataset [1] from Kaggle.

A note on the number of tokens and related costs: In the following example, we will embed the columns “category”, “question”, and “answer” for the first 100 rows. Based on a calculation with the tiktoken library, this will result in roughly 3,000 tokens to embed, which roughly results in $0.0003 inference costs with OpenAI’s Ada model (version 2) as of July 2023.

Step 1: Create a Schema

First, we need to define the underlying data structure and some configurations:

  • class: What will the collection of objects in this vector space be called?
  • properties: The properties of an object, including the property name and data type. In the Pandas Dataframe analogy, these would be the columns in the DataFrame.
  • vectorizer: The model that generates the embeddings. For text objects, you would typically select one of the text2vec modules (text2vec-cohere, text2vec-huggingface, text2vec-openai, or text2vec-palm) according to the provider you are using.
  • moduleConfig: Here, you can define the details of the used modules. E.g., the vectorizer is a module for which you can define which model and version to use.
class_obj = {
# Class definition
"class": "JeopardyQuestion",

# Property definitions
"properties": [
{
"name": "category",
"dataType": ["text"],
},
{
"name": "question",
"dataType": ["text"],
},
{
"name": "answer",
"dataType": ["text"],
},
],

# Specify a vectorizer
"vectorizer": "text2vec-openai",

# Module settings
"moduleConfig": {
"text2vec-openai": {
"vectorizeClassName": False,
"model": "ada",
"modelVersion": "002",
"type": "text"
},
},
}

In the above schema, you can see that we will create a class called "JeopardyQuestion", with the three text properties "category", "question", and "answer". The vectorizer we are using is OpenAI’s Ada model (version 2). All properties will be vectorized but not the class name ("vectorizeClassName" : False). If you have properties you don’t want to embed, you could specify this (see the docs).

Once you have defined the schema, you can create the class with the create_class() method.

client.schema.create_class(class_obj)

To check if the class has been created successfully, you can review its schema as follows:

client.schema.get("JeopardyQuestion")

The created schema looks as shown below:

{
"class": "JeopardyQuestion",
"invertedIndexConfig": {
"bm25": {
"b": 0.75,
"k1": 1.2
},
"cleanupIntervalSeconds": 60,
"stopwords": {
"additions": null,
"preset": "en",
"removals": null
}
},
"moduleConfig": {
"text2vec-openai": {
"model": "ada",
"modelVersion": "002",
"type": "text",
"vectorizeClassName": false
}
},
"properties": [
{
"dataType": [
"text"
],
"indexFilterable": true,
"indexSearchable": true,
"moduleConfig": {
"text2vec-openai": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "category",
"tokenization": "word"
},
{
"dataType": [
"text"
],
"indexFilterable": true,
"indexSearchable": true,
"moduleConfig": {
"text2vec-openai": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "question",
"tokenization": "word"
},
{
"dataType": [
"text"
],
"indexFilterable": true,
"indexSearchable": true,
"moduleConfig": {
"text2vec-openai": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "answer",
"tokenization": "word"
}
],
"replicationConfig": {
"factor": 1
},
"shardingConfig": {
"virtualPerPhysical": 128,
"desiredCount": 1,
"actualCount": 1,
"desiredVirtualCount": 128,
"actualVirtualCount": 128,
"key": "_id",
"strategy": "hash",
"function": "murmur3"
},
"vectorIndexConfig": {
"skip": false,
"cleanupIntervalSeconds": 300,
"maxConnections": 64,
"efConstruction": 128,
"ef": -1,
"dynamicEfMin": 100,
"dynamicEfMax": 500,
"dynamicEfFactor": 8,
"vectorCacheMaxObjects": 1000000000000,
"flatSearchCutoff": 40000,
"distance": "cosine",
"pq": {
"enabled": false,
"bitCompression": false,
"segments": 0,
"centroids": 256,
"encoder": {
"type": "kmeans",
"distribution": "log-normal"
}
}
},
"vectorIndexType": "hnsw",
"vectorizer": "text2vec-openai"
}

Step 2: Import data into Weaviate

At this stage, the vector database has a schema but is still empty. So, let’s populate it with our dataset. This process is also called “upserting”.

We will upsert the data in batches of 200. If you paid attention, you know this isn’t necessary here because we only have 100 rows of data. But once you are ready to upsert larger amounts of data, you will want to do this in batches. That’s why I’ll leave the code for batching here:

from weaviate.util import generate_uuid5

with client.batch(
batch_size=200, # Specify batch size
num_workers=2, # Parallelize the process
) as batch:
for _, row in df.iterrows():
question_object = {
"category": row.category,
"question": row.question,
"answer": row.answer,
}
batch.add_data_object(
question_object,
class_name="JeopardyQuestion",
uuid=generate_uuid5(question_object)
)

Although, Weaviate will generate a universally unique identifier (uuid) automatically, we will manually generate the uuid with the generate_uuid5() function from the question_object to avoid importing duplicate items.

For a sanity check, you can review the number of imported objects with the following code snippet:

client.query.aggregate("JeopardyQuestion").with_meta_count().do()
{'data': {'Aggregate': {'JeopardyQuestion': [{'meta': {'count': 100}}]}}}

The most common operation you will do with a vector database is to retrieve objects. To retrieve objects, you query the Weaviate vector database with the get() function:

client.query.get(
<Class>,
[<properties>]
).<arguments>.do()
  • Class: specifies the name of the class of objects to be retrieved. Here: "JeopardyQuestion"
  • properties: specifies the properties of the objects to be retrieved. Here: one or more of "category", "question", and "answer".
  • arguments: specifies the search criteria to retrieve the objects, such as limits or aggregations. We will cover some of these in the following examples.



Source link

Leave a Comment