{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "# Generative QA with RAGenerator" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> As of version 1.16, `RAGenerator` has been deprecated in Haystack and completely removed from Haystack as of v1.18. We recommend following the tutorial on [Creating a Generative QA Pipeline with Retrieval-Augmentation](https://haystack.deepset.ai/tutorials/22_pipeline_with_promptnode) instead. For more details about this deprecation, check out [our announcement](https://github.com/deepset-ai/haystack/discussions/4816) on Github." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "While extractive QA highlights the span of text that answers a query,\n", "generative QA can return a novel text answer that it has composed.\n", "In this tutorial, you will learn how to set up a generative system using the\n", "[RAG model](https://arxiv.org/abs/2005.11401) which conditions the\n", "answer generator on a set of retrieved documents." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "\n", "## Preparing the Colab Environment\n", "\n", "- [Enable GPU Runtime](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration#enabling-the-gpu-in-colab)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## Installing Haystack\n", "\n", "To start, let's install the latest release of Haystack with `pip`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "%%bash\n", "\n", "pip install --upgrade pip\n", "pip install farm-haystack[colab,faiss]==1.17.2" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Enabling Telemetry \n", "Knowing you're using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry) for more details." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from haystack.telemetry import tutorial_running\n", "\n", "tutorial_running(7)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "## Logging\n", "\n", "We configure how logging messages should be displayed and which log level should be used before importing Haystack.\n", "Example log message:\n", "INFO - haystack.utils.preprocessing - Converting data/tutorial1/218_Olenna_Tyrell.txt\n", "Default log level in basicConfig is WARNING so the explicit parameter is not necessary but can be changed easily:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import logging\n", "\n", "logging.basicConfig(format=\"%(levelname)s - %(name)s - %(message)s\", level=logging.WARNING)\n", "logging.getLogger(\"haystack\").setLevel(logging.INFO)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Fetching and Cleaning Documents" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Let's download a csv containing some sample text and preprocess the data.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import pandas as pd\n", "from haystack.utils import fetch_archive_from_http\n", "\n", "# Download sample\n", "doc_dir = \"data/tutorial7/\"\n", "s3_url = \"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/small_generator_dataset.csv.zip\"\n", "fetch_archive_from_http(url=s3_url, output_dir=doc_dir)\n", "\n", "# Create dataframe with columns \"title\" and \"text\"\n", "df = pd.read_csv(f\"{doc_dir}/small_generator_dataset.csv\", sep=\",\")\n", "# Minimal cleaning\n", "df.fillna(value=\"\", inplace=True)\n", "\n", "print(df.head())" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "We can cast our data into Haystack Document objects.\n", "Alternatively, we can also just use dictionaries with \"text\" and \"meta\" fields" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from haystack import Document\n", "\n", "# Use data to initialize Document objects\n", "titles = list(df[\"title\"].values)\n", "texts = list(df[\"text\"].values)\n", "documents = []\n", "for title, text in zip(titles, texts):\n", " documents.append(Document(content=text, meta={\"name\": title or \"\"}))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## Initializing the DocumentStore\n", "\n", "Here we initialize the FAISSDocumentStore. Set `return_embedding` to `True`, so Generator doesn't have to perform re-embedding" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from haystack.document_stores import FAISSDocumentStore\n", "\n", "document_store = FAISSDocumentStore(faiss_index_factory_str=\"Flat\", return_embedding=True)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Initializing the Retriever\n", "\n", "We initialize DensePassageRetriever to encode documents, encode question and query documents." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from haystack.nodes import RAGenerator, DensePassageRetriever\n", "\n", "retriever = DensePassageRetriever(\n", " document_store=document_store,\n", " query_embedding_model=\"facebook/dpr-question_encoder-single-nq-base\",\n", " passage_embedding_model=\"facebook/dpr-ctx_encoder-single-nq-base\",\n", " use_gpu=True,\n", " embed_title=True,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Initializing the Generator\n", "\n", "We initialize RAGenerator to generate answers from retrieved Documents." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "generator = RAGenerator(\n", " model_name_or_path=\"facebook/rag-token-nq\",\n", " use_gpu=True,\n", " top_k=1,\n", " max_length=200,\n", " min_length=2,\n", " embed_title=True,\n", " num_beams=2,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## Writing Documents\n", "\n", "We write documents to the DocumentStore, first by deleting any remaining documents then calling `write_documents()`.\n", "The `update_embeddings()` method uses the retriever to create an embedding for each document.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# Delete existing documents in documents store\n", "document_store.delete_documents()\n", "\n", "# Write documents to document store\n", "document_store.write_documents(documents)\n", "\n", "# Add documents embeddings to index\n", "document_store.update_embeddings(retriever=retriever)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Initializing the Pipeline\n", "\n", "With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.\n", "Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.\n", "To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `ExtractiveQAPipeline` that combines a retriever and a reader to answer our questions.\n", "You can learn more about `Pipelines` in the [docs](https://docs.haystack.deepset.ai/docs/pipelines)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from haystack.pipelines import GenerativeQAPipeline\n", "\n", "pipe = GenerativeQAPipeline(generator=generator, retriever=retriever)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## Asking a Question\n", "\n", "Now let's ask questions to our system!\n", "The Retriever will pick out a small subset of documents that it finds relevant.\n", "These are used to condition the Generator as it generates the answer.\n", "What it should return then are novel text spans that form and answer to your question!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from haystack.utils import print_answers\n", "\n", "QUESTIONS = [\n", " \"who got the first nobel prize in physics\",\n", " \"when is the next deadpool movie being released\",\n", " \"which mode is used for short wave broadcast service\",\n", " \"who is the owner of reading football club\",\n", " \"when is the next scandal episode coming out\",\n", " \"when is the last time the philadelphia won the superbowl\",\n", " \"what is the most current adobe flash player version\",\n", " \"how many episodes are there in dragon ball z\",\n", " \"what is the first step in the evolution of the eye\",\n", " \"where is gall bladder situated in human body\",\n", " \"what is the main mineral in lithium batteries\",\n", " \"who is the president of usa right now\",\n", " \"where do the greasers live in the outsiders\",\n", " \"panda is a national animal of which country\",\n", " \"what is the name of manchester united stadium\",\n", "]\n", "\n", "for question in QUESTIONS:\n", " res = pipe.run(query=question, params={\"Generator\": {\"top_k\": 1}, \"Retriever\": {\"top_k\": 5}})\n", " print_answers(res, details=\"minimum\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.6 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "3.9.6" }, "vscode": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" } } }, "nbformat": 4, "nbformat_minor": 2 }