With the advent of Llama 2, running strong LLMs locally has become more and more a reality. Its accuracy approaches OpenAI’s GPT-3.5, which serves well for many use cases.
In this article, we will explore how we can use Llama2 for Topic Modeling without the need to pass every single document to the model. Instead, we are going to leverage BERTopic, a modular topic modeling technique that can use any LLM for fine-tuning topic representations.
BERTopic works rather straightforward. It consists of 5 sequential steps:
- Embedding documents
- Reducing the dimensionality of embeddings
- Cluster reduced embeddings
- Tokenize documents per cluster
- Extract best-representing words per cluster
However, with the rise of LLMs like Llama 2, we can do much better than a bunch of independent words per topic. It is computationally not feasible to pass all documents to Llama 2 directly and have it analyze them. We can employ vector databases for search but we are not entirely sure which topics to search for.
Instead, we will leverage the clusters and topics that were created by BERTopic and have Llama 2 fine-tune and distill that information into something more accurate.
This is the best of both worlds, the topic creation of BERTopic together with the topic representation of Llama 2.
Now that this intro is out of the way, let’s start the hands-on tutorial!
We will start by installing a number of packages that we are going to use throughout this example:
pip install bertopic datasets accelerate bitsandbytes xformers adjustText
Keep in mind that you will need at least a T4 GPU in order to run this example, which can…