Skip to content

Commit

Permalink
Merge pull request #4 from TuanaCelik/main
Browse files Browse the repository at this point in the history
Add chroma example
  • Loading branch information
TuanaCelik committed Jan 2, 2024
2 parents f1d5186 + 1843b4c commit e017349
Show file tree
Hide file tree
Showing 2 changed files with 217 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ A collection of example notebooks using Haystack 👇
| Use Gemini Models with Vertex AI| <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/vertexai-gemini-examples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>|
| Gradient AI Embedders and Generators for RAG | <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/gradient-embeders-and-generators-for-notion-rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>|
| Hacker News RAG with Custom Component | <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/hackernews-custom-component-rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>|
| Use Chroma for RAG and Indexing | <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/chroma-indexing-and-rag-examples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>|
| Cohere for Multilingual QA (Haystack 1.x)| <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/haystack-1.x/cohere-for-multilingual-qa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>|
| GPT-4 and Weaviate for Custom Documentation QA (Haystack 1.x)| <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/haystack-1.x/gpt4-weaviate-custom-documentation-qa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>|
| Whisper Transcriber and Weaviate for YouTube video QA (Haystack 1.x)| <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/haystack-1.x/whisper-and-weaviate-for-youtube-rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>|
Expand Down
216 changes: 216 additions & 0 deletions chroma-indexing-and-rag-examples.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [
"# Use ChromaDocumentStore with Haystack\n",
"\n"
],
"metadata": {
"id": "ZjlwUPWugM37"
}
},
{
"cell_type": "markdown",
"source": [
">[Use ChromaDocumentStore with Haystack](#scrollTo=ZjlwUPWugM37)\n",
"\n",
">>[Install dependencies](#scrollTo=135w48jbgRRU)\n",
"\n",
">>[Indexing Pipeline: preprocess, split and index documents](#scrollTo=gt_XhGXBgU-I)\n",
"\n",
">>[Query Pipeline: build retrieval-augmented generation (RAG) pipelines](#scrollTo=44cRT55agw2e)\n",
"\n"
],
"metadata": {
"colab_type": "toc",
"id": "TjEesvJKiYKT"
}
},
{
"cell_type": "markdown",
"source": [
"## Install dependencies"
],
"metadata": {
"id": "135w48jbgRRU"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "znSRD-hO2doM"
},
"outputs": [],
"source": [
"# Install the Chroma integration, Haystack will come as a dependency\n",
"!pip install -U chroma-haystack"
]
},
{
"cell_type": "markdown",
"source": [
"## Indexing Pipeline: preprocess, split and index documents\n",
"In this section, we will index documents into a Chroma DB collection by building a Haystack indexing pipeline. Here, we are indexing documents from the [VIM User Manuel](https://vimhelp.org/) into the Haystack `ChromaDocumentStore`.\n",
"\n",
" We have the `.txt` files for these pages in the examples folder for the `ChromaDocumentStore`, so we are using the [`TextFileToDocument`](https://docs.haystack.deepset.ai/v2.0/docs/textfiletodocument) and [`DocumentWriter`](https://docs.haystack.deepset.ai/v2.0/docs/documentwriter) components to build this indexing pipeline."
],
"metadata": {
"id": "gt_XhGXBgU-I"
}
},
{
"cell_type": "code",
"source": [
"# Fetch data files from the Github repo\n",
"!curl -sL https://github.com/deepset-ai/haystack-core-integrations/tarball/main -o main.tar\n",
"!mkdir main\n",
"!tar xf main.tar -C main --strip-components 1\n",
"!mv main/integrations/chroma/example/data ."
],
"metadata": {
"id": "fGxsA9C74BWr"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"import os\n",
"from pathlib import Path\n",
"\n",
"from haystack import Pipeline\n",
"from haystack.components.converters import TextFileToDocument\n",
"from haystack.components.writers import DocumentWriter\n",
"\n",
"from chroma_haystack import ChromaDocumentStore\n",
"\n",
"file_paths = [\"data\" / Path(name) for name in os.listdir(\"data\")]\n",
"\n",
"# Chroma is used in-memory so we use the same instances in the two pipelines below\n",
"document_store = ChromaDocumentStore()\n",
"\n",
"indexing = Pipeline()\n",
"indexing.add_component(\"converter\", TextFileToDocument())\n",
"indexing.add_component(\"writer\", DocumentWriter(document_store))\n",
"indexing.connect(\"converter\", \"writer\")\n",
"indexing.run({\"converter\": {\"sources\": file_paths}})\n"
],
"metadata": {
"id": "ayyBKQIC3jGo"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Query Pipeline: build retrieval-augmented generation (RAG) pipelines\n",
"\n",
"Once we have documents in the `ChromaDocumentStore`, we can use the accompanying Chroma retrievers to build a query pipeline. The query pipeline below is a simple retrieval-augmented generation (RAG) pipeline that uses Chroma's [query API](https://docs.trychroma.com/usage-guide#querying-a-collection).\n",
"\n",
"You can change the idnexing pipeline and query pipelines here for embedding search by using one of the [`Haystack Embedders`](https://docs.haystack.deepset.ai/v2.0/docs/embedders) accompanied by the `ChromaEmbeddingRetriever`.\n",
"\n",
"\n",
"In this example we are using:\n",
"- The `HuggingFaceTGIGenerator` with the Mistral 8x7B model. (You will need a Hugging Face token to use this model). You can repleace this with any of the other [`Generators`](https://docs.haystack.deepset.ai/v2.0/docs/generators)\n",
"- The `PromptBuilder` which holds the prompt template. You can adjust this to a prompt of your choice\n",
"- The `ChromaQueryRetriver` which expects a list of queries and retieves the `top_k` most relevant documents from your Chroma collection."
],
"metadata": {
"id": "44cRT55agw2e"
}
},
{
"cell_type": "code",
"source": [
"from getpass import getpass\n",
"\n",
"hf_token = getpass(\"Enter Hugging Face API key:\")"
],
"metadata": {
"id": "WGGApIR3pllW"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"from chroma_haystack.retriever import ChromaQueryRetriever\n",
"from haystack.components.generators import HuggingFaceTGIGenerator\n",
"from haystack.components.builders import PromptBuilder\n",
"\n",
"prompt = \"\"\"\n",
"Answer the query based on the provided context.\n",
"If the context does not contain the answer, say 'Answer not found'.\n",
"Context:\n",
"{% for doc in documents %}\n",
" {{ doc.content }}\n",
"{% endfor %}\n",
"query: {{query}}\n",
"Answer:\n",
"\"\"\"\n",
"prompt_builder = PromptBuilder(template=prompt)\n",
"\n",
"llm = HuggingFaceTGIGenerator(model=\"mistralai/Mixtral-8x7B-Instruct-v0.1\", token=hf_token)\n",
"llm.warm_up()\n",
"retriever = ChromaQueryRetriever(document_store)\n",
"\n",
"querying = Pipeline()\n",
"querying.add_component(\"retriever\", retriever)\n",
"querying.add_component(\"prompt_builder\", prompt_builder)\n",
"querying.add_component(\"llm\", llm)\n",
"\n",
"querying.connect(\"retriever.documents\", \"prompt_builder.documents\")\n",
"querying.connect(\"prompt_builder\", \"llm\")"
],
"metadata": {
"id": "YQJTPYNreNV-"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"query = \"Should I write documentation for my plugin?\"\n",
"results = querying.run({\"retriever\": {\"queries\": [query], \"top_k\": 3},\n",
" \"prompt_builder\": {\"query\": query},\n",
" \"llm\":{\"generation_kwargs\": {\"max_new_tokens\": 350}}})"
],
"metadata": {
"id": "O8jcmcdqrGu1"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"print(results[\"llm\"][\"replies\"][0])"
],
"metadata": {
"id": "Pa7f7EzjtBXw"
},
"execution_count": null,
"outputs": []
}
]
}

0 comments on commit e017349

Please sign in to comment.