Skip to content

Commit

Permalink
make tutorial 14 testable (deepset-ai#19)
Browse files Browse the repository at this point in the history
* make tutorial 14 testable

* comment out draw

* minor

* enable on nightly
  • Loading branch information
masci committed Dec 16, 2022
1 parent 210cdb9 commit 9b71be9
Show file tree
Hide file tree
Showing 3 changed files with 196 additions and 80 deletions.
1 change: 1 addition & 0 deletions .github/workflows/nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ jobs:
- 10_Knowledge_Graph
- 11_Pipelines
- 12_LFQA
- 14_Query_Classifier
- 15_TableQA
- 16_Document_Classifier_at_Index_Time
- 17_Audio
Expand Down
114 changes: 76 additions & 38 deletions markdowns/14_Query_Classifier.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ layout: tutorial
colab: https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/14_Query_Classifier.ipynb
toc: True
title: "Query Classifier"
last_updated: 2022-11-24
last_updated: 2022-12-15
level: "intermediate"
weight: 80
description: Classify incoming queries so that they can be routed to the nodes that are best at handling them.
Expand Down Expand Up @@ -44,20 +44,27 @@ Make sure you enable the GPU runtime to experience decent speed in this tutorial

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/main/docs/img/colab_gpu_runtime.jpg">

You can double check whether the GPU runtime is enabled with the following command:


```bash
%%bash

nvidia-smi
```

Next we make sure the latest version of Haystack is installed:


```python
# Install the latest release of Haystack in your own environment
#! pip install farm-haystack
```bash
%%bash

# Install the latest main of Haystack (Colab)
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]
pip install --upgrade pip
pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]

# Install these to allow pipeline visualization
!apt install libgraphviz-dev
!pip install pygraphviz
apt install libgraphviz-dev
pip install pygraphviz
```

### Logging
Expand Down Expand Up @@ -156,38 +163,43 @@ And as we see, the question "Who was the father of Arya Stark" is sent to branch

Now let's see how we can use query classifiers in a question-answering (QA) pipeline. We start by initiating Elasticsearch:

#### Start an Elasticsearch server
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source.


```python
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2
# Recommended: Start Elasticsearch using Docker via the Haystack utility function
from haystack.utils import launch_es

import os
from subprocess import Popen, PIPE, STDOUT
launch_es()
```

es_server = Popen(
["elasticsearch-7.9.2/bin/elasticsearch"], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1) # as daemon
)
# wait until ES has started
! sleep 30
#### Start an Elasticsearch server in Colab

If Docker is not readily available in your environment (e.g. in Colab notebooks), then you can manually download and execute Elasticsearch from source.


```bash
%%bash

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.9.2
sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch -d
```


```bash
%%bash --bg

sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch
```

Next we fetch some data&mdash;for our example we'll use pages from the Game of Thrones wiki&mdash;and index it in our `DocumentStore`:


```python
from haystack.utils import (
print_answers,
print_documents,
fetch_archive_from_http,
convert_files_to_docs,
clean_wiki_text,
launch_es,
)
from haystack.pipelines import Pipeline
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.nodes import BM25Retriever, EmbeddingRetriever, FARMReader, TransformersQueryClassifier
from haystack.utils import fetch_archive_from_http, convert_files_to_docs, clean_wiki_text

# Download and prepare data - 517 Wikipedia articles for Game of Thrones
doc_dir = "data/tutorial14"
Expand All @@ -196,9 +208,23 @@ fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

# convert files to dicts containing documents that can be indexed to our datastore
got_docs = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)
```


```python
import os
import time

from haystack.document_stores import ElasticsearchDocumentStore


# Wait 30 seconds only to be sure Elasticsearch is ready before continuing
time.sleep(30)

# Get the host where Elasticsearch is running, default to localhost
host = os.environ.get("ELASTICSEARCH_HOST", "localhost")


# Initialize DocumentStore and index documents
# launch_es() # Uncomment this line for local Elasticsearch
document_store = ElasticsearchDocumentStore()
document_store.delete_documents()
document_store.write_documents(got_docs)
Expand All @@ -212,6 +238,9 @@ We start by initializing our retrievers and reader:


```python
from haystack.nodes import BM25Retriever, EmbeddingRetriever, FARMReader


# Initialize sparse retriever for keyword queries
bm25_retriever = BM25Retriever(document_store=document_store)

Expand All @@ -228,6 +257,9 @@ Now we define our pipeline. As promised, the question/statement branch `output_1


```python
from haystack.pipelines import Pipeline


# Here we build the pipeline
sklearn_keyword_classifier = Pipeline()
sklearn_keyword_classifier.add_node(component=SklearnQueryClassifier(), name="QueryClassifier", inputs=["Query"])
Expand All @@ -237,14 +269,17 @@ sklearn_keyword_classifier.add_node(
sklearn_keyword_classifier.add_node(component=bm25_retriever, name="BM25Retriever", inputs=["QueryClassifier.output_2"])
sklearn_keyword_classifier.add_node(component=reader, name="QAReader", inputs=["BM25Retriever", "EmbeddingRetriever"])

# Visualization of the pipeline
sklearn_keyword_classifier.draw("sklearn_keyword_classifier.png")
# To generate a visualization of the pipeline, uncomment the following:
# sklearn_keyword_classifier.draw("sklearn_keyword_classifier.png")
```

Below, we can see how this choice affects the branching structure: the keyword query "arya stark father" and the question query "Who is the father of Arya Stark?" generate noticeably different results, a distinction that is likely due to the use of different retrievers for keyword vs. question/statement queries.


```python
from haystack.utils import print_answers


# Useful for framing headers
equal_line = "=" * 30

Expand Down Expand Up @@ -320,14 +355,17 @@ transformer_question_classifier.add_node(
)
transformer_question_classifier.add_node(component=reader, name="QAReader", inputs=["QueryClassifier.output_1"])

# Visualization of the pipeline
transformer_question_classifier.draw("transformer_question_classifier.png")
# To generate a visualization of the pipeline, uncomment the following:
# transformer_question_classifier.draw("transformer_question_classifier.png")
```

And here are the results of this pipeline: with a question query like "Who is the father of Arya Stark?", we obtain answers from a reader, and with a statement query like "Arya Stark was the daughter of a Lord", we just obtain documents from a retriever.


```python
from haystack.utils import print_documents


# Useful for framing headers
equal_line = "=" * 30

Expand Down Expand Up @@ -407,7 +445,7 @@ pd.DataFrame.from_dict(sent_results)
You can also perform zero-shot classification by providing a suitable base transformer model and **choosing** the classes the model should predict.
For example, we may be interested in whether the user query is related to music or cinema.

*In this case, the `labels` parameter is a list containing the candidate classes.*
In this case, the `labels` parameter is a list containing the candidate classes.


```python
Expand Down
Loading

0 comments on commit 9b71be9

Please sign in to comment.