Skip to content

Commit

Permalink
Refactor Tutorial 04 to make it testable (deepset-ai#8)
Browse files Browse the repository at this point in the history
* refactor Tutorial 04 to make it testable

* fix ES docstore creation

* update markdown version

* wait for ES to be ready
  • Loading branch information
masci committed Sep 16, 2022
1 parent f84091f commit 356ef2d
Show file tree
Hide file tree
Showing 4 changed files with 142 additions and 95 deletions.
3 changes: 2 additions & 1 deletion .github/workflows/nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,10 @@ jobs:
max-parallel: 2
matrix:
notebook:
# Note: use only the name of the file without the extension
# Note: use the name of the file without the extension
- 01_Basic_QA_Pipeline
- 03_Basic_QA_Pipeline_without_Elasticsearch
- 04_FAQ_style_QA

env:
ELASTICSEARCH_HOST: "elasticsearch"
Expand Down
4 changes: 3 additions & 1 deletion .github/workflows/run_tutorials.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,5 +49,7 @@ jobs:
done
- name: Run the converted notebooks
# Note: the `+` at the end of the `find` invocation will make it fail if any
# of the execs failed, otherwise `find` returns 0 even when the execs fail.
run: |
find ./tutorials -name "*.py" -execdir python {} \;
find ./tutorials -name "*.py" -execdir python {} +;
75 changes: 44 additions & 31 deletions markdowns/4.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,20 +33,23 @@ Make sure you enable the GPU runtime to experience decent speed in this tutorial

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/main/docs/img/colab_gpu_runtime.jpg">

You can double check whether the GPU runtime is enabled with the following command:

```python
# Make sure you have a GPU running
!nvidia-smi

```bash
%%bash

nvidia-smi
```

To start, install the latest release of Haystack with `pip`:

```python
# Install the latest release of Haystack in your own environment
#! pip install farm-haystack

# Install the latest main of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]
```bash
%%bash

pip install --upgrade pip
pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]
```

## Logging
Expand All @@ -64,14 +67,6 @@ logging.basicConfig(format="%(levelname)s - %(name)s - %(message)s", level=logg
logging.getLogger("haystack").setLevel(logging.INFO)
```


```python
from haystack.document_stores import ElasticsearchDocumentStore

from haystack.nodes import EmbeddingRetriever
import pandas as pd
```

### Start an Elasticsearch server
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source.

Expand All @@ -83,36 +78,49 @@ from haystack.utils import launch_es
launch_es()
```

### Start an Elasticsearch server in Colab

```python
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2
If Docker is not readily available in your environment (e.g. in Colab notebooks), then you can manually download and execute Elasticsearch from source.

import os
from subprocess import Popen, PIPE, STDOUT

es_server = Popen(
["elasticsearch-7.9.2/bin/elasticsearch"], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1) # as daemon
)
# wait until ES has started
! sleep 30
```bash
%%bash

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.9.2
sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch -d
```


```bash
%%bash --bg

sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch
```

### Init the DocumentStore
In contrast to Tutorial 1 (extractive QA), we:
In contrast to Tutorial 1 (Build your first QA system), we:

* specify the name of our `text_field` in Elasticsearch that we want to return as an answer
* specify the name of our `embedding_field` in Elasticsearch where we'll store the embedding of our question and that is used later for calculating our similarity to the incoming user question
* set `excluded_meta_data=["question_emb"]` so that we don't return the huge embedding vectors in our search results


```python
import os
import time

from haystack.document_stores import ElasticsearchDocumentStore

# Wait 30 seconds only to be sure Elasticsearch is ready before continuing
time.sleep(30)

# Get the host where Elasticsearch is running, default to localhost
host = os.environ.get("ELASTICSEARCH_HOST", "localhost")

document_store = ElasticsearchDocumentStore(
host="localhost",
host=host,
username="",
password="",
index="document",
Expand All @@ -129,6 +137,8 @@ We can use the `EmbeddingRetriever` for this purpose and specify a model that we


```python
from haystack.nodes import EmbeddingRetriever

retriever = EmbeddingRetriever(
document_store=document_store,
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
Expand All @@ -143,8 +153,11 @@ Here: We download some question-answer pairs related to COVID-19


```python
import pandas as pd

from haystack.utils import fetch_archive_from_http


# Download
doc_dir = "data/tutorial4"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/small_faq_covid.csv.zip"
Expand Down
Loading

0 comments on commit 356ef2d

Please sign in to comment.