As you can see, vector embeddings are pretty cool.
Let’s go back to our example and say we embed the content of every book in the library and store these embeddings in a vector database. Now, when you want to find a “children’s book with a main character that likes food”, your query is also embedded, and the books that are most similar to your query are returned, such as “The Very Hungry Caterpillar” or maybe “Goldilocks and the Three Bears”.
What are the use cases of vector databases?
Vector databases have been around before the hype around Large Language Models (LLMs) started. Originally, they were used in recommendation systems because they can quickly find similar objects for a given query. But because they can provide long-term memory to LLMs, they have also been used in question-answering applications recently.
If you could already guess that vector databases are probably a way to store vector embeddings before opening this article and just want to know what vector embeddings are under the hood, then let’s get into the nitty-gritty and talk about algorithms.
How do vector databases work?
Vector databases are able to retrieve similar objects of a query quickly because they have already pre-calculated them. The underlying concept is called Approximate Nearest Neighbor (ANN) search, which uses different algorithms for indexing and calculating similarities.
As you can imagine, calculating the similarities between a query and every embedded object you have with a simple k-nearest neighbors (kNN) algorithm can become time-consuming when you have millions of embeddings. With ANN, you can trade in some accuracy in exchange for speed and retrieve the approximately most similar objects to a query.
Indexing — For this, a vector database indexes the vector embeddings. This step maps the vectors to a data structure that will enable faster searching.
You can think of indexing as grouping the books in a library into different categories, such as author or genre. But because embeddings can hold more complex information, further categories could be “gender of the main character” or “main location of plot”. Indexing can thus help you retrieve a smaller portion of all the available vectors and thus speeds up retrieval.
We will not go into the technical details of indexing algorithms, but if you are interested in further reading, you might want to start by looking up Hierarchical Navigable Small World (HNSW).
Similarity Measures — To find the nearest neighbors to the query from the indexed vectors, a vector database applies a similarity measure. Common similarity measures include cosine similarity, dot product, Euclidean distance, Manhattan distance, and Hamming distance.
What is the advantage of vector databases over storing the vector embeddings in a NumPy array?
A question I have come across often (already) is: Can’t we just use NumPy arrays to store the embeddings? — Of course, you can if you don’t have many embeddings or if you are just working on a fun hobby project. But as you can already guess, vector databases are noticeably faster when you have a lot of embeddings, and you don’t have to hold everything in memory.
I’ll keep this short because Ethan Rosenthal has done a much better job explaining the difference between using a vector database vs. using a NumPy array than I could ever write.