How to add Domain-Specific Knowledge to an LLM Based on Your Data | by Antoine Villatte | Jul, 2023


If you don’t have an LLM installed on your computer, you can find a step-by-step guide on how to do that here :
https://medium.com/better-programming/how-to-run-your-personal-chatgpt-like-model-locally-505c093924bc

You can find the full code to build the vector index database on this repo :
https://github.com/Anvil-Late/knowledge_llm/tree/main

Broadly speaking, in the src folder :

  • parse.py creates the PEP corpus
  • embed.py creates the embedded corpus
  • You can pull the docker image of the Qdrant vector index database and run it with the commands docker pull qdrant/qdrant and
    docker run -d -p 6333:6333 qdrant/qdrant
  • create_index.py creates and populates the vector index database
  • query_index.py embeds a query and retrieves the most relevant documentation

If you need more details, you can find my step-by-step guide here :
https://betterprogramming.pub/efficiently-navigate-massive-documentations-ai-powered-natural-language-queries-for-knowledge-372f4711a7c8

First, we’ll write a script that generates a prompt for the LLM :

import os
from query_index import DocSearch
import logging
import re
from utils.parse_tools import remove_tabbed_lines
logging.disable(logging.INFO)

def set_global_logging_level(level=logging.ERROR, prefices=[""]):
"""
Override logging levels of different modules based on their name as a prefix.
It needs to be invoked after the modules have been loaded so that their loggers have been initialized.

Args:
- level: desired level. e.g. logging.INFO. Optional. Default is logging.ERROR
- prefices: list of one or more str prefices to match (e.g. ["transformers", "torch"]). Optional.
Default is `[""]` to match all active loggers.
The match is a case-sensitive `module_name.startswith(prefix)`
"""
prefix_re = re.compile(fr'^(?:{ "|".join(prefices) })')
for name in logging.root.manager.loggerDict:
if re.match(prefix_re, name):
logging.getLogger(name).setLevel(level)

def main(
query,
embedder = "instructor",
top_k = None,
block_types = None,
score = False,
open_url = True,
print_output = True
):

# Set up query
query_machine = DocSearch(
embedder=embedder,
top_k=top_k,
block_types=block_types,
score=score,
open_url=open_url,
print_output=print_output
)

query_output = query_machine(query)

# Generate prompt
prompt = f"""
Below is an relevant documentation and a query. Write a response that appropriately completes the query based on the relevant documentation provided.

Relevant documentation: {remove_tabbed_lines(query_output)}

Query: {query}

Response: Here's the answer to your query:"""

print(prompt)
return prompt

if __name__ == '__main__':
set_global_logging_level(logging.ERROR, ["transformers", "nlp", "torch", "tensorflow", "tensorboard", "wandb"])
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--query', type=str, default=None)
parser.add_argument('--top_k', type=int, default=5)
parser.add_argument('--block_types', type=str, default='text')
parser.add_argument('--score', type=bool, default=False)
parser.add_argument('--open_url', type=bool, default=False)
parser.add_argument('--embedder', type=str, default='instructor')
parser.add_argument('--print_output', type=bool, default=False)
args = parser.parse_args()
main(**vars(args))

logging.disable(logging.INFO) and set_global_logging_level prevent excessive prints during code execution, since everything printed by this script will be captured.

We combine this prompt generation with prompt injection with the following bash script :

#!/bin/bash

# Get the query from the command-line argument
query="$1"

# Launch prompt generation script with argument --query
if ! prompt=$(python src/query_llm.py --query "$query" --top_k 1); then
echo "Error running query_llm.py"
exit 1
fi

# Run the terminal command
<PATH_TO_LLAMA.CPP>/main
-t 8
-m <PATH_TO_LLAMA.CPP>/models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin
--color
-c 4000
--temp 0.7
--repeat_penalty 1.1
-n -1
-p "$prompt"
-ngl 1

What happens here is that the prompt generation script prints the prompt, and the bash script captures it in the $prompt variable, which is then used in the llama.cpp ./main command with the -p(or --prompt) parameter.

The LLM will then take over and complete the prompt starting from ‘Response: Here’s the answer to your query:’.

Remember to replace <PATH_TO_LLAMA.CPP> to the path of your llama.cpp clone in your computer, and Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin to your LLM. Personally, I chose this one because it gave me pretty good results and it is not under a restrictive licence, but feel free to try it with other models !

Let’s recap what we have accomplished here:

Throughout this article, we have delved into an effective strategy to augment the capabilities of Large Language Models (LLMs) by infusing them with domain knowledge. While LLMs have demonstrated remarkable proficiency in a variety of tasks, they often encounter difficulties when confronted with highly specialized domains that necessitate precise knowledge and nuanced understanding.

To address these limitations, we explored a methodology that involves incorporating domain-specific documentation into LLMs. By constructing a vector index database based on the documentation, we established a foundation for efficient similarity and semantic search. This allowed us to identify the most relevant pieces of documentation for a given query, which could then be injected as context into a local LLM.

The approach we presented was exemplified through the utilization of Python Enhancement Programs (PEPs) as a representative dataset. However, it is important to note that this methodology is applicable to any form of documentation. The code snippets and repository provided in this article serve as practical demonstrations, showcasing the implementation process.

By following the outlined steps, users can enhance LLM performance within specific professional contexts, enabling the models to navigate complex industry-specific jargon and generate more accurate responses. Moreover, the secure and open-source technologies employed in this strategy ensure that the process can be executed locally without external internet dependencies, thereby safeguarding privacy and confidentiality.

In conclusion, the infusion of domain knowledge into LLMs empowers these models to excel in specialized tasks, as they gain a deeper understanding of the context in which they operate. The implications of this approach extend across diverse domains, enabling LLMs to provide invaluable assistance and insights tailored to specific professional requirements. By leveraging the potential of LLMs combined with domain expertise, we unlock a new realm of possibilities for improving human-AI interactions and leveraging the power of artificial intelligence in specialized domains.

If you have any questions, don’t hesitate to leave it in the comments, I’ll do my best to answer you!



Source link

Leave a Comment