Skip to content

Commit

Permalink
refactor Tutorial 05 to be testable (deepset-ai#9)
Browse files Browse the repository at this point in the history
* refactor Tutorial 05 to be testable

* pass host to all the ds constructors
  • Loading branch information
masci committed Sep 16, 2022
1 parent 356ef2d commit 78aceb0
Show file tree
Hide file tree
Showing 3 changed files with 173 additions and 582 deletions.
1 change: 1 addition & 0 deletions .github/workflows/nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ jobs:
- 01_Basic_QA_Pipeline
- 03_Basic_QA_Pipeline_without_Elasticsearch
- 04_FAQ_style_QA
- 05_Evaluation

env:
ELASTICSEARCH_HOST: "elasticsearch"
Expand Down
85 changes: 53 additions & 32 deletions markdowns/5.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,20 +22,23 @@ Make sure you enable the GPU runtime to experience decent speed in this tutorial

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/main/docs/img/colab_gpu_runtime.jpg">

You can double check whether the GPU runtime is enabled with the following command:

```python
# Make sure you have a GPU running
!nvidia-smi

```bash
%%bash

nvidia-smi
```

To start, install the latest release of Haystack with `pip`:

```python
# Install the latest release of Haystack in your own environment
#! pip install farm-haystack

# Install the latest main of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]
```bash
%%bash

pip install --upgrade pip
pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]
```

## Logging
Expand All @@ -54,27 +57,42 @@ logging.getLogger("haystack").setLevel(logging.INFO)
```

## Start an Elasticsearch server
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source.

You can start Elasticsearch on your local machine instance using Docker:


```python
# If Docker is available: Start Elasticsearch as docker container
# from haystack.utils import launch_es
# launch_es()
# Recommended: Start Elasticsearch using Docker via the Haystack utility function
from haystack.utils import launch_es

# Alternative in Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2
launch_es()
```

import os
from subprocess import Popen, PIPE, STDOUT
If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source:

es_server = Popen(
["elasticsearch-7.9.2/bin/elasticsearch"], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1) # as daemon
)
# wait until ES has started
! sleep 30

```bash
%%bash

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.9.2
```


```bash
%%bash --bg

sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch
```

Wait 30 seconds only to be sure Elasticsearch is ready before continuing:


```python
import time

time.sleep(30)
```

## Fetch, Store And Preprocess the Evaluation Dataset
Expand All @@ -83,6 +101,7 @@ es_server = Popen(
```python
from haystack.utils import fetch_archive_from_http


# Download evaluation data, which is a subset of Natural Questions development set containing 50 documents with one question per document and multiple annotated answers
doc_dir = "data/tutorial5"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/nq_dev_subset_v2.json.zip"
Expand All @@ -91,19 +110,21 @@ fetch_archive_from_http(url=s3_url, output_dir=doc_dir)


```python
import os

from haystack.document_stores import ElasticsearchDocumentStore


# make sure these indices do not collide with existing ones, the indices will be wiped clean before data is inserted
doc_index = "tutorial5_docs"
label_index = "tutorial5_labels"
```


```python
# Connect to Elasticsearch
from haystack.document_stores import ElasticsearchDocumentStore
# Get the host where Elasticsearch is running, default to localhost
host = os.environ.get("ELASTICSEARCH_HOST", "localhost")

# Connect to Elasticsearch
document_store = ElasticsearchDocumentStore(
host="localhost",
host=host,
username="",
password="",
index=doc_index,
Expand Down Expand Up @@ -490,7 +511,7 @@ EXPERIMENT_NAME = "haystack-tutorial-5"


```python
document_store = ElasticsearchDocumentStore(index="sparse_index", recreate_index=True)
document_store = ElasticsearchDocumentStore(host=host, index="sparse_index", recreate_index=True)
preprocessor = PreProcessor(
split_length=200,
split_overlap=0,
Expand Down Expand Up @@ -524,7 +545,7 @@ sparse_eval_result = Pipeline.execute_eval_run(


```python
document_store = ElasticsearchDocumentStore(index="dense_index", recreate_index=True)
document_store = ElasticsearchDocumentStore(host=host, index="dense_index", recreate_index=True)
emb_retriever = EmbeddingRetriever(
document_store=document_store,
model_format="sentence_transformers",
Expand Down
Loading

0 comments on commit 78aceb0

Please sign in to comment.