refactor Tutorial 05 to be testable (deepset-ai#9)

* refactor Tutorial 05 to be testable * pass host to all the ds constructors
blancadesal · Sep 16, 2022 · 78aceb0 · 78aceb0
1 parent 356ef2d
commit 78aceb0
Show file tree

Hide file tree

Showing 3 changed files with 173 additions and 582 deletions.
diff --git a/.github/workflows/nightly.yml b/.github/workflows/nightly.yml
@@ -25,6 +25,7 @@ jobs:
  - 01_Basic_QA_Pipeline
  - 03_Basic_QA_Pipeline_without_Elasticsearch
  - 04_FAQ_style_QA
+ - 05_Evaluation
 
  env:
  ELASTICSEARCH_HOST: "elasticsearch"

diff --git a/markdowns/5.md b/markdowns/5.md
@@ -22,20 +22,23 @@ Make sure you enable the GPU runtime to experience decent speed in this tutorial
 
 <img src="https://raw.githubusercontent.com/deepset-ai/haystack/main/docs/img/colab_gpu_runtime.jpg">
 
+You can double check whether the GPU runtime is enabled with the following command:
 
-```python
-# Make sure you have a GPU running
-!nvidia-smi
+
+```bash
+%%bash
+
+nvidia-smi
 ```
 
+To start, install the latest release of Haystack with `pip`:
 
-```python
-# Install the latest release of Haystack in your own environment
-#! pip install farm-haystack
 
-# Install the latest main of Haystack
-!pip install --upgrade pip
-!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]
+```bash
+%%bash
+
+pip install --upgrade pip
+pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]
 ```
 
 ## Logging
@@ -54,27 +57,42 @@ logging.getLogger("haystack").setLevel(logging.INFO)
 ```
 
 ## Start an Elasticsearch server
-You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source.
+
+You can start Elasticsearch on your local machine instance using Docker:
 
 
 ```python
-# If Docker is available: Start Elasticsearch as docker container
-# from haystack.utils import launch_es
-# launch_es()
+# Recommended: Start Elasticsearch using Docker via the Haystack utility function
+from haystack.utils import launch_es
 
-# Alternative in Colab / No Docker environments: Start Elasticsearch from source
-! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
-! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
-! chown -R daemon:daemon elasticsearch-7.9.2
+launch_es()
+```
 
-import os
-from subprocess import Popen, PIPE, STDOUT
+If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source:
 
-es_server = Popen(
- ["elasticsearch-7.9.2/bin/elasticsearch"], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1) # as daemon
-)
-# wait until ES has started
-! sleep 30
+
+```bash
+%%bash
+
+wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
+tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
+chown -R daemon:daemon elasticsearch-7.9.2
+```
+
+
+```bash
+%%bash --bg
+
+sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch
+```
+
+Wait 30 seconds only to be sure Elasticsearch is ready before continuing:
+
+
+```python
+import time
+
+time.sleep(30)
 ```
 
 ## Fetch, Store And Preprocess the Evaluation Dataset
@@ -83,6 +101,7 @@ es_server = Popen(
 ```python
 from haystack.utils import fetch_archive_from_http
 
+
 # Download evaluation data, which is a subset of Natural Questions development set containing 50 documents with one question per document and multiple annotated answers
 doc_dir = "data/tutorial5"
 s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/nq_dev_subset_v2.json.zip"
@@ -91,19 +110,21 @@ fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
 
 
 ```python
+import os
+
+from haystack.document_stores import ElasticsearchDocumentStore
+
+
 # make sure these indices do not collide with existing ones, the indices will be wiped clean before data is inserted
 doc_index = "tutorial5_docs"
 label_index = "tutorial5_labels"
-```
 
-
-```python
-# Connect to Elasticsearch
-from haystack.document_stores import ElasticsearchDocumentStore
+# Get the host where Elasticsearch is running, default to localhost
+host = os.environ.get("ELASTICSEARCH_HOST", "localhost")
 
 # Connect to Elasticsearch
 document_store = ElasticsearchDocumentStore(
- host="localhost",
+ host=host,
  username="",
  password="",
  index=doc_index,
@@ -490,7 +511,7 @@ EXPERIMENT_NAME = "haystack-tutorial-5"
 
 
 ```python
-document_store = ElasticsearchDocumentStore(index="sparse_index", recreate_index=True)
+document_store = ElasticsearchDocumentStore(host=host, index="sparse_index", recreate_index=True)
 preprocessor = PreProcessor(
  split_length=200,
  split_overlap=0,
@@ -524,7 +545,7 @@ sparse_eval_result = Pipeline.execute_eval_run(
 
 
 ```python
-document_store = ElasticsearchDocumentStore(index="dense_index", recreate_index=True)
+document_store = ElasticsearchDocumentStore(host=host, index="dense_index", recreate_index=True)
 emb_retriever = EmbeddingRetriever(
  document_store=document_store,
  model_format="sentence_transformers",