Build Industry-Specific LLMs Using Retrieval Augmented Generation | by Skanda Vivek | May, 2023

You can do the same thing with words or sentences, instead of pictures. Notice how in the above example, the vectorization is able to capture the semantic representation i.e. it knows that a sentence talking about a bird swooping in on a baby chipmunk should be in the (small, animal) quadrant, whereas the sentence talking about yesterday’s storm when a large tree fell on the road should be in the (big, tree) quadrant. In reality, there are more than 2 dimensions — 384 for sentence transformer embeddings (sentences or chunks of text are passed through the BERT transformer, and 384 numbers corresponding to the 384 dimensions are outputted.)

Vector Search + LLM (RAG)

Vector search is now commonplace in all search (Google, Bing, pick your favorite search engine). Think about if your entire search database has only these 4 documents. You can ask the question, “Tell me something about a big tree.”

The output will be:

During yesterday’s storm, a large tree fell on the road.

Just the most relevant document in the database. But if you asked an LLM the same question (and append in the 4 “documents” at the end), this is what you get (ChatGPT in this example):

A big tree is typically a large, towering plant with a sturdy trunk, branches, and an extensive network of roots. These trees can reach impressive heights, often extending well above the surrounding vegetation. They serve as essential components of many ecosystems, providing shade, shelter, and food for various organisms.

The incident you mentioned about a large tree falling on the road during a storm indicates the power and impact of severe weather conditions. Storms with strong winds can uproot or break trees, leading to road blockages and potential hazards for vehicles and pedestrians.

In another scenario you mentioned, while driving on the road in Bandipur, you encountered a large elephant crossing the road. Bandipur is a national park and wildlife sanctuary located in India, known for its rich biodiversity and diverse animal population. Elephants are one of the prominent species found in Bandipur and are known to traverse the area, sometimes crossing roads. Encounters like these highlight the importance of wildlife conservation and the need for humans to coexist with animals in their natural habitats.

Switching topics to Bonsai, they are miniature trees that are grown in small pots and meticulously shaped and pruned to maintain their small size. Bonsai cultivation is an ancient art form that originated in East Asia and has gained popularity worldwide. Due to their limited root space, Bonsai trees require regular watering to prevent them from drying out. The small pots can cause the soil to dry quickly, so it’s crucial to monitor the moisture levels and water them accordingly.

Finally, you mentioned observing a bird swooping in on a baby chipmunk through your window. This observation reflects a moment of nature in action, where a bird might have been hunting for food. Birds of prey and certain bird species, such as hawks or owls, are known to swoop down on small mammals like chipmunks as part of their natural hunting behavior. It’s fascinating to witness such interactions, as they provide insights into the dynamics of the animal kingdom.

Overall, these snippets of nature — whether they involve big trees, elephants, Bonsai trees, or bird predation — highlight the diversity and wonder of the natural world that surrounds us.

That answer seemed much more well rounded. ChatGPT was using its internal knowledge from being trained on trillions of documents, in combination with the context it was given.

But let’s say you have more than just 4 documents — maybe thousands or millions, that you can’t fit into the ChatGPT prompt. In this case, you can use vector search to narrow down the context most likely to contain the answer, append it to the prompt, and ask the same question as follows:

This is the (truncated) answer it now gives:

ChatGPT answer | Skanda Vivek

You could then have a database, that stores documents and embeddings. You can have another DB that stores queries, and finds the most relevant documents based on the queries:

Document DB (Left) and Quey DB (Right) | Skanda Vivek

Once you have the most similar document(s) by query, you can feed that to any LLM like ChatGPT. By this simple trick, you have augmented your LLM using document retrieval! This is also known as retrieval augmented generation (RAG).

Source link

Leave a Comment