The document retrieval API for Kakawa, a product-support bot that efficiently responds to questions given any documentation. It can run online with cutting-edge AI models, or offline on regular hardware. It can also be integrated with existing platforms such as WhatsApp.
This service leverages LlamaIndex to perform efficient Augmented Retrieval Generation (RAG) queries, which are state-of-the-art techniques for using AI to respond to questions from pre-loaded documents.
Evaluation-driven-development is guaranteed by the RAGAS automatic evaluation framework, alongside LangFuse dashboard for monitoring and manual evaluation.
The API is designed to work with both local and cloud-based models for generating embeddings and reponses, ensuring adaptability for different environments. Locally, it supports Ollama for calling Large Language Models (LLMs) and fastembed for local embeddings. For cloud services, it integrates with OpenAI which include advanced models like GPT-3-Turbo and GPT-4, to provide fast and robust language understanding, and OpenAI's and Cohere embedding models.
RAGAS evaluations can run both online and offline, although it's highly recommended to run online with advance models such as GPT-4 for better results.
The monitoring dashboard, LangFuse, can be self-hosted and runs well offline. A free to use cloud hosted version is also available.
- LlamaIndex: Sets up with the OpenAI or Ollama model and integrates with LangFuse for logging and RAGAS for evaluation.
- Embedding Model: Selects and configures the embedding model.
- Vector Store: Initializes the documents and indexes them using Qdrant.
- Indexing and Querying: Builds an index from documents and creates a query engine for document retrieval.
- Evaluation and Tracing: Outlines the evaluation setup with RAGAS and includes provisions for LangFuse tracing.
- API Endpoints: Uses FastAPI for generating an API and documentation.
Dependencies for this script are managed by Poetry and are specified in the pyproject.toml
file. Sensitive keys and configurations are stored in an .env
file, which is not included in version control as indicated by the .gitignore
file.
In order to run the full stack locally you'll need to have QDrant, Ollama and LangFuse installed and running. You can run them all using Docker and docker-compose, skip to the section for details.
To use the QDrant, OpenAI and LangFuse cloud platforms, follow these steps:
- Sign up for an OpenAI account at https://openai.com
- After signing up, navigate to the API section to obtain your API key
- Update the
.env
file with your OpenAI API key by setting theOPENAI_API_KEY
variable
- Sign up for a QDrant cloud account at https://cloud.qdrant.io
- Obtain your API key and QDrant URL
- Update the
.env
file with your QDrant API key and URL
- Sign up for a LangFuse cloud account at https://cloud.langfuse.com
- Obtain your public and secret keys
- Update the
.env
file with your LangFuse public and secret keys, and set theLANGFUSE_HOST
to the cloud URL
Before starting, make sure you've set up the stack locally or on the cloud, and added the keys and URLs to the .env
file.
Make sure to have Python and Poetry installed. To run the API follow these steps:
- Run
cp .env.example .env
and ensure all the required environment variables are set. - Install all dependencies using Poetry by running
poetry install
. - Activate the virtual environment created by Poetry with
poetry shell
. - Run the app with
uvicorn app.main:app
, add the--reload
flag for development.
This will start a server on port 8000
, you can test the API with curl:
curl -X 'GET' \
'https://127.0.0.1:8000/rag?query=How%20to%20use%20Mapeo%20tracks%20feature%3F&user_id=test_user&session_id=test_session' \
-H 'accept: application/json'
FastAPI automatically provides swagger documentation and interactive documentation.
Alternatively, if you are using Docker:
- Build the Docker image with
docker build -t kakakwa-rag-api .
- Run the Docker container with
docker run --env-file .env -v ./docs:/usr/src/app/docs --network host kakakwa-rag-api
To run the full stack (QDrant, Ollama and LangFuse) use docker-compose:
- Ensure all the required environment variables are set in the
.env
file. - Start the service with
docker-compose up
, or just desired services such asdocker-compose up langfuse-server langfuse-db
.
- Basic
/rag
endpoint with FastAPI - Working RAG querying
- Integration with LangFuse
- Offline-first RAG
- Integrate a cloud-hosted vector-store
- Test stack with Github Actions
- Integrate with Google Drive
- Caching of vector-database and embeddings
- Evaluation data generation
- Evaluation
- Tuning of RAG: tecniques, embeddings, top_k, chunk sizes...