In the last section, I explained two search algorithms that work on inverted index. These algorithms are **exact**, meaning that the `k`

best candidates they return are the true `k`

best candidates. Sounds trite? Well, this is a question we should ask ourselves whenever we design search algorithms on large-scale data, because in many scenarios it is not necessary to get the true `k`

best candidates.

Think about the Spotify example again: do you really care if the result of a search may miss a few people with similar taste as yours? Most people understand that in everyday applications (Google, Spotify, Twitter, etc.), the search is never exhaustive or exact. These applications are not mission critical enough to justify exact search. This is why the most widely used search algorithms are all approximate.

There are mainly two benefits of using approximate search algorithms:

- Faster. You can cut many corners if you no longer need exact results.
- Predicable resource consumption. This one is less obvious, but for several approximate algorithms, their resource usage (e.g., memory) can be configured
*a priori*independent of data distribution.

In this post, I write about the most widely used approximate index for Jaccard: Minwise Locality Sensitive Hashing (MinHash LSH).

## What is LSH?

Locality Sensitive Hashing indexes are true wonders in computer science. They are algorithmic magic powered by number theory. In machine learning literature, they are k-NN models, but unlike typical machine learning models, LSH indexes are data-agnostic so their accuracy conditioned on similarity can be determined *a priori* before ingesting new data points or changing data distribution. So, they are more similar to inverted indexes than a model.

An LSH index is essentially a set of hash tables each with a different hash function. Just like a typical hash table, an LSH index’s hash function takes a data point (e.g., a set, a feature vector, or an embedding vector) as input, and outputs a binary hash key. Except for this, they cannot be more different.

A typical hash function outputs keys that are pseudo-randomly and uniformly distributed over the entire key space for any input data. For example, MurmurHash is a well-known hash function that outputs near-uniformly and randomly over a 32-bit key space. This means that for any two inputs, such as `abcdefg`

and `abcefg`

, as long as they are different, their MurmurHash keys should not be correlated and should have the same probability to be any one of the keys in the entire 32-bit key space. This is a desired property of a hash function, because you want even distribution of keys over hash buckets to avoid chaining or constantly resizing your hash table.

An LSH’s hash function does something opposite: for a pair of similar inputs, with similarity defined through some metric space measure, their hash keys should be more likely to be equal, than another pair of hash keys of dis-similar inputs.

What does this mean? It means that an LSH hash function has higher probability of hash key collision for data points that are more similar. Effectively, we are utilizing this higher collision probability for similarity-based retrieval.

**MinHash LSH**

For every similarity/distance metric, there is an LSH hash function. For Jaccard, the function is called *Minwise Hash Function*, or *MinHash function*. Given an input set, a MinHash function consumes all elements with a random hash function and keeps track of the minimum hash value observed. You can build an LSH index using a single MinHash function. See the diagram below.

The mathematical theory behind MinHash function states that the probability of two sets having the same minimum hash value (i.e., hash key collision) is the same as their Jaccard.

It is a magical result, but the proof is quite simple.

A MinHash LSH index with a single MinHash function does not give you satisfactory accuracy because the collision probability is linearly proportional to the Jaccard. See the following plot to understand why.

Imagine we draw a threshold at Jaccard = 0.9: results with higher Jaccard than 0.9 with the query set is relevant, wheres results with lower than 0.9 Jaccard are irrelevant. In the context of search, the notion of “false positive” means that irrelevant results are returned, wheres the notion of “false negative” means that relevant results are not returned. Based on the plot above and looking at the area corresponding to false positive: if the index only uses a single MinHash function, it is going to produce false positives at a very high probability.

**Boosting the Accuracy of MinHash LSH**

This why we need another LSH magic: a process called *boosting. *We can boost the index to be much more attuned to the relevancy threshold specified.

Instead of only one, we use `m`

MinHash functions generated through a process called *Universal Hashing** — *basically `m`

random permutations of the same hash function of 32-bit or 64-bit integer. For every indexed set, we generate `m`

minimum hash values using universal hashing.

Imagine you list the `m`

minimum hash values for an indexed set. We group every `r`

number of hash values into a band of hash values, and we make `b`

such bands. This requires `m = b * r`

.

The probability that two sets having “band collision” — all the hash values in a band collide between two sets, or `r`

contiguous hash collisions, is `Jaccard(A, B)^r`

. That’s a lot smaller than a single hash value. However, the probability of having at least one “band collision” between two sets is `1 — (1-Jaccard(A, B)^r)^b`

.

Why do we care about `1 — (1-Jaccard(A, B)^r)^b`

? Because this function has a special shape:

In the plot above, you can see by using `m`

MinHash functions, the “at-least-one band collision” probability is an S-curve function with a steep rise around Jaccard = 0.9. Assuming the relevancy threshold is 0.9, the false positive probability of this index is much smaller than the index that uses only one random hash function.

Because of this, an LSH index always uses `b`

bands of `r`

MinHash functions to boost accuracy. Each band is a hash table storing pointers to indexed sets. During search, any indexed set collides with the query set on any band is returned.

To build an MinHash LSH index, we can specify *a prior* a relevancy threshold and acceptable false positive and negative probabilities conditioned on Jaccard similarity, and calculate the optimal `m`

, `b`

and `r`

, before indexing any data points. This is a great advantage of using LSH over other approximate indexes.

You can find my implementation of MinHash LSH in the Python package datasketch. It also has other MinHash-related algorithms like LSH Forest and Weighted MinHash.

I have covered a lot of topics in this post, but I barely scratched the surface of search indexes for Jaccard similarity. If you are interested in reading more about these topics, I have a list of further readings for you:

- Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeff Ullman. The 3rd chapter goes into detail about MinHash and LSH. I think it is a great chapter for gaining the intuition of MinHash. Be aware the application described in the chapter is focused on n-gram based text matching.
- JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. The preliminary section of this paper explains the intuitation behind the
`search_top_k_merge_list`

and`search_top_k_probe_set`

algorithms. The main section explains how to take cost into consideration when input sets are large, such as a table column. - Datasketch and SetSimilaritySearch libraries respectively implement the state-of-the-art approximate and exact Jaccard similarity search indexes. The issues list of the datasketch project is a treasure trove of application scenarios and practical considerations when applying MinHash LSH.

## What about Embeddings?

In recent years, due to breakthroughs in representation learning using deep neural networks like Transformers, similarity between learned embedding vectors is meaningful when the input data is part of the same domain that the embedding model trained on. The main differences between that scenario and the search scenario described in this post are:

- Embedding vectors are dense vectors with typically 60 to 700 dimensions. Every dimension is non-zero. In contrast, sets, when represented as one-hot vectors are sparse: 10k to millions of dimensions, but most dimensions are zeros.
- Cosine similarity (or dot-product on normalized vectors) is typically used for embedding vectors. For sets we use Jaccard similarity.
- It is hard to specify a relevancy threshold on similarity between embedding vectors, because the vectors are learned black-box representations of the original data such as image or text. On the other hand, Jaccard similarity threshold for sets is much easier to specify because
**sets are the original data**.

Due to the above differences, it is not straightforward to compare embeddings and sets because they are distinctively different data types, even though you could classify both of them as high-dimensional. They are suitable for different application scenarios.