Created using Colaboratory

deepset-ai · TuanaCelik · Jan 2, 2024 · Jan 2, 2024 · Jan 2, 2024 · Jan 2, 2024
commit 38760f0629563d74231917ec3587331fb723b2cf
diff --git a/chroma-indexing-and-rag-examples.ipynb b/chroma-indexing-and-rag-examples.ipynb
@@ -0,0 +1,216 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Use ChromaDocumentStore with Haystack\n",
+ "\n"
+ ],
+ "metadata": {
+ "id": "ZjlwUPWugM37"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ ">[Use ChromaDocumentStore with Haystack](#scrollTo=ZjlwUPWugM37)\n",
+ "\n",
+ ">>[Install dependencies](#scrollTo=135w48jbgRRU)\n",
+ "\n",
+ ">>[Indexing Pipeline: preprocess, split and index documents](#scrollTo=gt_XhGXBgU-I)\n",
+ "\n",
+ ">>[Query Pipeline: build retrieval-augmented generation (RAG) pipelines](#scrollTo=44cRT55agw2e)\n",
+ "\n"
+ ],
+ "metadata": {
+ "colab_type": "toc",
+ "id": "TjEesvJKiYKT"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Install dependencies"
+ ],
+ "metadata": {
+ "id": "135w48jbgRRU"
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "znSRD-hO2doM"
+ },
+ "outputs": [],
+ "source": [
+ "# Install the Chroma integration, Haystack will come as a dependency\n",
+ "!pip install -U chroma-haystack"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Indexing Pipeline: preprocess, split and index documents\n",
+ "In this section, we will index documents into a Chroma DB collection by building a Haystack indexing pipeline. Here, we are indexing documents from the [VIM User Manuel](https://vimhelp.org/) into the Haystack `ChromaDocumentStore`.\n",
+ "\n",
+ " We have the `.txt` files for these pages in the examples folder for the `ChromaDocumentStore`, so we are using the [`TextFileToDocument`](https://docs.haystack.deepset.ai/v2.0/docs/textfiletodocument) and [`DocumentWriter`](https://docs.haystack.deepset.ai/v2.0/docs/documentwriter) components to build this indexing pipeline."
+ ],
+ "metadata": {
+ "id": "gt_XhGXBgU-I"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Fetch data files from the Github repo\n",
+ "!curl -sL https://github.com/deepset-ai/haystack-core-integrations/tarball/main -o main.tar\n",
+ "!mkdir main\n",
+ "!tar xf main.tar -C main --strip-components 1\n",
+ "!mv main/integrations/chroma/example/data ."
+ ],
+ "metadata": {
+ "id": "fGxsA9C74BWr"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "import os\n",
+ "from pathlib import Path\n",
+ "\n",
+ "from haystack import Pipeline\n",
+ "from haystack.components.converters import TextFileToDocument\n",
+ "from haystack.components.writers import DocumentWriter\n",
+ "\n",
+ "from chroma_haystack import ChromaDocumentStore\n",
+ "\n",
+ "file_paths = [\"data\" / Path(name) for name in os.listdir(\"data\")]\n",
+ "\n",
+ "# Chroma is used in-memory so we use the same instances in the two pipelines below\n",
+ "document_store = ChromaDocumentStore()\n",
+ "\n",
+ "indexing = Pipeline()\n",
+ "indexing.add_component(\"converter\", TextFileToDocument())\n",
+ "indexing.add_component(\"writer\", DocumentWriter(document_store))\n",
+ "indexing.connect(\"converter\", \"writer\")\n",
+ "indexing.run({\"converter\": {\"sources\": file_paths}})\n"
+ ],
+ "metadata": {
+ "id": "ayyBKQIC3jGo"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Query Pipeline: build retrieval-augmented generation (RAG) pipelines\n",
+ "\n",
+ "Once we have documents in the `ChromaDocumentStore`, we can use the accompanying Chroma retrievers to build a query pipeline. The query pipeline below is a simple retrieval-augmented generation (RAG) pipeline that uses Chroma's [query API](https://docs.trychroma.com/usage-guide#querying-a-collection).\n",
+ "\n",
+ "You can change the idnexing pipeline and query pipelines here for embedding search by using one of the [`Haystack Embedders`](https://docs.haystack.deepset.ai/v2.0/docs/embedders) accompanied by the `ChromaEmbeddingRetriever`.\n",
+ "\n",
+ "\n",
+ "In this example we are using:\n",
+ "- The `HuggingFaceTGIGenerator` with the Mistral 8x7B model. (You will need a Hugging Face token to use this model). You can repleace this with any of the other [`Generators`](https://docs.haystack.deepset.ai/v2.0/docs/generators)\n",
+ "- The `PromptBuilder` which holds the prompt template. You can adjust this to a prompt of your choice\n",
+ "- The `ChromaQueryRetriver` which expects a list of queries and retieves the `top_k` most relevant documents from your Chroma collection."
+ ],
+ "metadata": {
+ "id": "44cRT55agw2e"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "from getpass import getpass\n",
+ "\n",
+ "hf_token = getpass(\"Enter Hugging Face API key:\")"
+ ],
+ "metadata": {
+ "id": "WGGApIR3pllW"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "from chroma_haystack.retriever import ChromaQueryRetriever\n",
+ "from haystack.components.generators import HuggingFaceTGIGenerator\n",
+ "from haystack.components.builders import PromptBuilder\n",
+ "\n",
+ "prompt = \"\"\"\n",
+ "Answer the query based on the provided context.\n",
+ "If the context does not contain the answer, say 'Answer not found'.\n",
+ "Context:\n",
+ "{% for doc in documents %}\n",
+ " {{ doc.content }}\n",
+ "{% endfor %}\n",
+ "query: {{query}}\n",
+ "Answer:\n",
+ "\"\"\"\n",
+ "prompt_builder = PromptBuilder(template=prompt)\n",
+ "\n",
+ "llm = HuggingFaceTGIGenerator(model=\"mistralai/Mixtral-8x7B-Instruct-v0.1\", token=hf_token)\n",
+ "llm.warm_up()\n",
+ "retriever = ChromaQueryRetriever(document_store)\n",
+ "\n",
+ "querying = Pipeline()\n",
+ "querying.add_component(\"retriever\", retriever)\n",
+ "querying.add_component(\"prompt_builder\", prompt_builder)\n",
+ "querying.add_component(\"llm\", llm)\n",
+ "\n",
+ "querying.connect(\"retriever.documents\", \"prompt_builder.documents\")\n",
+ "querying.connect(\"prompt_builder\", \"llm\")"
+ ],
+ "metadata": {
+ "id": "YQJTPYNreNV-"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "query = \"Should I write documentation for my plugin?\"\n",
+ "results = querying.run({\"retriever\": {\"queries\": [query], \"top_k\": 3},\n",
+ " \"prompt_builder\": {\"query\": query},\n",
+ " \"llm\":{\"generation_kwargs\": {\"max_new_tokens\": 350}}})"
+ ],
+ "metadata": {
+ "id": "O8jcmcdqrGu1"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "print(results[\"llm\"][\"replies\"][0])"
+ ],
+ "metadata": {
+ "id": "Pa7f7EzjtBXw"
+ },
+ "execution_count": null,
+ "outputs": []
+ }
+ ]
+}