DEV Community

parmarjatin4911@gmail.com
[email protected]

Posted on

Build a simple RAG chatbot with LangChain...

In this blog post, I will guide you through the process of creating a unique RAG (Retrieval Augmented Generation) chatbot. Unlike typical chatbots, this one is specifically designed for handling queries related to very specific topics or articles. By utilising the RAG technique, our chatbot can generate responses to complex queries that standard chatbots might find challenging.

Let’s narrow our focus to a topic that typically falls outside the capabilities of a generic chatbot. We are delving into a niche area, such as providing horoscope predictions for individuals born under the Sagittarius sign in the year 2024. Imagine having a chatbot that can answer questions from Sagittarius individuals about what the year 2024 holds for them. While this may seem highly specific, it serves as an example of the type of specialised chatbot we aim to create. With our target set on a fortune-telling chatbot for Sagittarius folks, let’s dive into the tools and techniques we’ll employ to create this bot.
What is LLM and RAG

I believe anyone involved in the tech world has come across the term LLM. With the rise of generative AI, “LLM” has turned into a key term for many developers, particularly those who are interested in or are currently working in the AI field. But what exactly is LLM?

LLM

Large Language Models (LLMs) form a specific category within the broader field of Natural Language Processing (NLP). These models specialise in generating text by analysing and processing vast datasets. Their notable strength lies in their capacity to comprehend and generate language in a broad and versatile manner. LLMs use something called the transformer model. The transformer model is a neural network that learns context and semantic meaning in sequential data like text.

A well-known example of a chatbot using LLM technology is ChatGPT, which incorporates the GPT-3.5 and GPT-4 models.

As for this blog, we will be using a model from MistralAI called Mixtral8x7b. This model is capable of matching or surpassing the performance of Llama 70B and GPT-3.5 and it is available for free use.
Problems with Generic LLM

When it comes to Large Language Models (LLMs), there are two possible scenarios involving topics that they may be less knowledgeable about.

Firstly, the model may straightforwardly admit that it lacks information on a particular subject because it hasn’t been trained on that specific data.

Secondly, there’s the potential for what’s known as “hallucination”, where the model generates responses that are inaccurate or misleading due to its lack of specialised knowledge. This is because generic LLMs are not trained with detailed information in certain areas, such as specific legal rules or medical data, which typically fall outside the scope of a general-purpose LLM’s training data.

To address this issue, one method is to fine-tune the model by adding specific data to it and tailor it for particular needs. However, this blog will focus on a simpler approach called RAG, or Retrieval-Augmented Generation.
Introducing RAG

RAG, short for Retrieval-Augmented Generation, is a way to boost what Large Language Models (LLMs) know by adding more data to them. It’s made up of two main components:

Indexing: This is about taking in data from various sources and organising it in a way that the system can easily use, which is indexing.
Retrieval and Generation: To delve deeper into how RAG functions, let’s understand its two primary processes: retrieval and generation. The retrieval component acts like a focused search engine, scanning a database of indexed information to find relevant data related to the user’s query. This data is then fed into the Large Language Model. The model uses this context, along with its trained knowledge base, to generate a response that’s more informed and accurate. This synergistic process allows RAG to provide more precise answers by supplementing its extensive but generalised training with specific, targeted information.
Enter fullscreen mode Exit fullscreen mode

In simpler terms, RAG helps LLMs to be more knowledgeable by pulling in extra information when needed to answer questions better. This is how the architecture of the chatbot will look:

Useful tools
LangChain
LangChain is an open-source framework written in Python and JavaScript, designed for building applications centred around language models. LangChain provides components that allow non-AI experts to be able to implement existing AI language models into their applications. This framework is versatile and supports various functions such as text summarisation, tagging, and others. However, this blog will specifically concentrate on the creation of RAG creation.

Hugging Face
Hugging Face is an open-source platform focused on data science and machine learning. It lets users share their machine learning models. On Hugging Face, you can find a variety of machine learning pre-trained models, including those for natural language processing (NLP), computer vision, and more. Many of these models are equipped with inference capabilities, allowing you to integrate them into your applications to generate results.

Pinecone: Vector database
Before we delve into Pinecone, let’s clarify what a vector database is. A vector database, such as Pinecone, stores data in the form of vectors, which are an arrays of numbers — e.g. [0.1, 3.21, -1.3, 9.2, …]. This approach allows for efficient similarity searches, as it groups similar data and enables models or applications to retrieve relevant information effectively.

Pinecone is a cloud-based vector database optimised for machine learning applications. It is designed to efficiently store and retrieve dense vector embeddings, making it ideal for enhancing Large Language Models (LLMs) with long-term memory and improving their performance in tasks such as natural language processing. It offers quick data retrieval, ideal for chatbots, and includes a free tier for storing up to 100,000 vectors. Although there are open-source vector databases available like Chroma, Weaviate, and Milvus, Pinecone is preferred for its simplicity and ease of use.
Pinecone: Vector database

Let’s build something!
Setting up

Before implementing the code, make sure you have set up the following:

Hugging Face account setup
Enter fullscreen mode Exit fullscreen mode
  1. To create a Hugging Face account, you can go to this link: Huggingfaceand sign up.

  2. After signing up, go to Your Profile page, click on Edit Profile, and go to Access Tokens.

  3. On the Access Tokens page, create a new token called “RAG-Chatbot”, or similar. Make sure no one has access to this token except you.

    Pinecone account setup

  4. To create a PineCone account, sign up via this link: https://www.pinecone.io/

  5. After registering with the free tier, go into the project, and click on Create a Project.

    Project structure and environment

After completing the account setup, you can create a directory called “Chatbot”. Inside the Chatbot directory, create a file called .env. The context inside .env should look like the image below (replace xxx with your Hugging Face access tokens and Pinecone API Keys). We will be using this

.env file

PINECONE_API_KEY=xxxxxx
HUGGINGFACE_API_KEY=xxxxx

Finally, create a file called main.py, and create an empty class called Chatbot inside it. This class is going to be called when we implement the frontend UI. For now, just create the class without adding any more code to it.

Importing dependencies

Here is the list of dependencies you should install prior to the implementation:

langchain : To be able to import components and chain from the langchain library.

pinecone-client : To be able to connect with Pinecone cloud server.

streamlit : Used for creating UI pages with Python code.

python-dotenv : To be able to use the environment variable stored in .env file.

You can store this list inside requirements.txt as shown below (dependencies versions are optional):

langchain==0.1.1
pinecone-client==2.2.4
python-dotenv==1.0.0
streamlit==1.29.0

After you get your requirements.txt inside your project directory, install the dependencies using this command:

pip install -r requirements.txt

Data indexing

Before we start indexing, we need to first have the data that our model will use to answer questions. In this case, we need a text file or a pdf about Horoscope for Sagittarius in 2024, as our chatbot will be answering questions related to this. Here, we will be taking text from any Horoscope website and storing it in a horoscope.txt file inside our project directory. Feel free to use any blog or article — just make sure you are using the right context.

Having gathered the textual content for our RAG application, it’s time to move on to the data indexing phase. Initially, we’ll break down the text files into manageable segments. This is done by deploying a text splitter where we define the dimensions of these segments. In this example, we’re setting the chunk_size to 1000 and chunk_overlap to 4.

Next, we introduce an embedding utility - specifically the HuggingFaceEmbedding tool. This will be instrumental in embedding our text segments.

from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain.embeddings import HuggingFaceEmbeddings

loader = TextLoader('./horoscope.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=4)
docs = text_splitter.split_documents(documents)

embeddings = HuggingFaceEmbeddings()

Following the embedding process, the next step involves depositing these embedded text fragments into our vector database, Pinecone, for efficient storage and retrieval.

First, we initialise the Pinecone client in our application using the API key from the Pinecone dashboard. Then, we assign an index name and check if it already exists in Pinecone. If it does exist, we link it to the docsearch variable. If not, we create a new index using pinecone.create_index, with cosine as the metric and a dimension of 768, suitable for HuggingFace embeddings.

from langchain.vectorstores import Pinecone
import pinecone

Initialize Pinecone client

pinecone.init(
api_key= os.getenv('PINECONE_API_KEY'),
environment='gcp-starter'
)

Define Index Name

index_name = "langchain-demo"

Checking Index

if index_name not in pinecone.list_indexes():
# Create new Index
pinecone.create_index(name=index_name, metric="cosine", dimension=768)
docsearch = Pinecone.from_documents(docs, embeddings, index_name=index_name)
else:
# Link to the existing index
docsearch = Pinecone.from_existing_index(index_name, embeddings)

Model setup

Now that we have our embedded texts on the vector database, let’s move on to the model setup part. Of course, we don’t want to create, train, and deploy the LLM from scratch locally. This is why we are using HuggingFaceHub, which is a platform we can connect and call the model without having to deploy it on our machine.

With HuggingFaceHub, we just define the ID of the model we want to use - in this case, it’s mistralai/Mixtral-8x7B-Instruct-v0.1 . We also can define:

temperature: which controls the randomness in the output
top_k: limits the number of highest probability next words to k
Enter fullscreen mode Exit fullscreen mode

Don’t forget to add huggingfacehub_api_tokensfrom the HuggingFace dashboard console.

from langchain.llms import HuggingFaceHub

Define the repo ID and connect to Mixtral model on Huggingface

repo_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
llm = HuggingFaceHub(
repo_id=repo_id,
model_kwargs={"temperature": 0.8, "top_k": 50},
huggingfacehub_api_token=os.getenv('HUGGINGFACE_API_KEY')
)

Prompt exngineering

For LLM to answer our question, we need to define a prompt that will contain all of the necessary information. This allows us to customise the model to fit our needs. In our case, we will tell the model to be a fortune teller and answer only relevant questions. Additionally, we need to pass {context} and {question} to the prompt. These values will be replaced with the data chunk we retrieve from our vector database for {context} and the question the user asked for the {question}.

With this template created, we then define the PromptTemplate object taking our template and input variables (context and questions) as a parameter.

from langchain import PromptTemplate

template = """
You are a fortune teller. These Human will ask you a questions about their life.
Use following piece of context to answer the question.
If you don't know the answer, just say you don't know.
Keep the answer within 2 sentences and concise.

Context: {context}
Question: {question}
Answer:

"""

prompt = PromptTemplate(
template=template,
input_variables=["context", "question"]
)

Chaining it all together

Now that we have:

Pinecone database index object( docsearch)
PromptTemplate ( prompt )
Model ( llm )
Enter fullscreen mode Exit fullscreen mode

We are ready to chain them together. The process starts with docsearch pulling relevant documents to provide context. Then, the query goes through unchanged using RunnablePassthrough. Next, a prompt step refines or modifies the query before it's processed by our model, llm. Finally, the response from the model is turned into text with StrOutputParser.

from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

rag_chain = (
{"context": docsearch.as_retriever(), "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)

Finalising model

Now that we have our rag_chain ready, let’s put it into our previously created Chatbot() class. Simply put all of the code we have added into the Chatbot class.

Import dependencies here

class ChatBot():
load_dotenv()
loader = TextLoader('./horoscope.txt')
documents = loader.load()

# The rest of the code here

rag_chain = (
{"context": docsearch.as_retriever(), "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)

Now that we got the Chatbot ready, we can test it out. In main.py , add the following code at the end (this is just for testing purposes and should be removed this later):

Outside ChatBot() class

bot = ChatBot()
input = input("Ask me anything: ")
result = bot.rag_chain.invoke(input)
print(result)

Here is what I got when I ran main.py and asked questions about my life in 2024 (as a Sagittarius).
Testing the Model
Streamlit frontend

Here is how I implement the model with Streamlit frontend. As this blog is about the RAG LLM chatbot, I won’t go deep into implementing the frontend side, but here is the Streamlit template I use for creating Chat UI and some basic functions I use for calling the model. Make sure to use this code in another file — in this example, this code belongs to streamlit.py .

from main import ChatBot
import streamlit as st

bot = ChatBot()

st.set_page_config(page_title="Random Fortune Telling Bot")
with st.sidebar:
st.title('Random Fortune Telling Bot')

Function for generating LLM response

def generate_response(input):
result = bot.rag_chain.invoke(input)
return result

Store LLM generated responses

if "messages" not in st.session_state.keys():
st.session_state.messages = [{"role": "assistant", "content": "Welcome, let's unveil your future"}]

Display chat messages

for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.write(message["content"])

User-provided prompt

if input := st.chat_input():
st.session_state.messages.append({"role": "user", "content": input})
with st.chat_message("user"):
st.write(input)

Generate a new response if last message is not from assistant

if st.session_state.messages[-1]["role"] != "assistant":
with st.chat_message("assistant"):
with st.spinner("Getting your answer from mystery stuff.."):
response = generate_response(input)
st.write(response)
message = {"role": "assistant", "content": response}
st.session_state.messages.append(message)

After completing all the steps, you should be able to create something like this by running:

streamlit run streamlit.py

As you can see, the bot can answer (or fortune tell) any question related to my life in 2024 using the Horoscope text I got from online articles. It also does not answer irrelevant questions, thanks to the prompt template we have used.
Conclusion

As you can see, the model isn’t perfect and there are still many things to add to and improve the model in future. However, this will hopefully give you a basic understanding of how to create an RAG chatbot and how vector databases work.

You can take a look at the code repository here in my github repo.

Feel free to connect with me on LinkedIn and share your thoughts!

Top comments (0)