Refactor Tutorial 04 to make it testable (deepset-ai#8)

* refactor Tutorial 04 to make it testable * fix ES docstore creation * update markdown version * wait for ES to be ready
blancadesal · Sep 16, 2022 · 356ef2d · 356ef2d
1 parent f84091f
commit 356ef2d
Show file tree

Hide file tree

Showing 4 changed files with 142 additions and 95 deletions.
diff --git a/.github/workflows/nightly.yml b/.github/workflows/nightly.yml
@@ -21,9 +21,10 @@ jobs:
  max-parallel: 2
  matrix:
  notebook:
- # Note: use only the name of the file without the extension
+ # Note: use the name of the file without the extension
  - 01_Basic_QA_Pipeline
  - 03_Basic_QA_Pipeline_without_Elasticsearch
+ - 04_FAQ_style_QA
 
  env:
  ELASTICSEARCH_HOST: "elasticsearch"

diff --git a/.github/workflows/run_tutorials.yml b/.github/workflows/run_tutorials.yml
@@ -49,5 +49,7 @@ jobs:
  done
 
  - name: Run the converted notebooks
+ # Note: the `+` at the end of the `find` invocation will make it fail if any
+ # of the execs failed, otherwise `find` returns 0 even when the execs fail.
  run: |
- find ./tutorials -name "*.py" -execdir python {} \;
+ find ./tutorials -name "*.py" -execdir python {} +;
diff --git a/markdowns/4.md b/markdowns/4.md
@@ -33,20 +33,23 @@ Make sure you enable the GPU runtime to experience decent speed in this tutorial
 
 <img src="https://raw.githubusercontent.com/deepset-ai/haystack/main/docs/img/colab_gpu_runtime.jpg">
 
+You can double check whether the GPU runtime is enabled with the following command:
 
-```python
-# Make sure you have a GPU running
-!nvidia-smi
+
+```bash
+%%bash
+
+nvidia-smi
 ```
 
+To start, install the latest release of Haystack with `pip`:
 
-```python
-# Install the latest release of Haystack in your own environment
-#! pip install farm-haystack
 
-# Install the latest main of Haystack
-!pip install --upgrade pip
-!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]
+```bash
+%%bash
+
+pip install --upgrade pip
+pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]
 ```
 
 ## Logging
@@ -64,14 +67,6 @@ logging.basicConfig(format="%(levelname)s - %(name)s - %(message)s", level=logg
 logging.getLogger("haystack").setLevel(logging.INFO)
 ```
 
-
-```python
-from haystack.document_stores import ElasticsearchDocumentStore
-
-from haystack.nodes import EmbeddingRetriever
-import pandas as pd
-```
-
 ### Start an Elasticsearch server
 You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source.
 
@@ -83,36 +78,49 @@ from haystack.utils import launch_es
 launch_es()
 ```
 
+### Start an Elasticsearch server in Colab
 
-```python
-# In Colab / No Docker environments: Start Elasticsearch from source
-! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
-! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
-! chown -R daemon:daemon elasticsearch-7.9.2
+If Docker is not readily available in your environment (e.g. in Colab notebooks), then you can manually download and execute Elasticsearch from source.
 
-import os
-from subprocess import Popen, PIPE, STDOUT
 
-es_server = Popen(
- ["elasticsearch-7.9.2/bin/elasticsearch"], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1) # as daemon
-)
-# wait until ES has started
-! sleep 30
+```bash
+%%bash
+
+wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
+tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
+chown -R daemon:daemon elasticsearch-7.9.2
+sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch -d
+```
+
+
+```bash
+%%bash --bg
+
+sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch
 ```
 
 ### Init the DocumentStore
-In contrast to Tutorial 1 (extractive QA), we:
+In contrast to Tutorial 1 (Build your first QA system), we:
 
 * specify the name of our `text_field` in Elasticsearch that we want to return as an answer
 * specify the name of our `embedding_field` in Elasticsearch where we'll store the embedding of our question and that is used later for calculating our similarity to the incoming user question
 * set `excluded_meta_data=["question_emb"]` so that we don't return the huge embedding vectors in our search results
 
 
 ```python
+import os
+import time
+
 from haystack.document_stores import ElasticsearchDocumentStore
 
+# Wait 30 seconds only to be sure Elasticsearch is ready before continuing
+time.sleep(30)
+
+# Get the host where Elasticsearch is running, default to localhost
+host = os.environ.get("ELASTICSEARCH_HOST", "localhost")
+
 document_store = ElasticsearchDocumentStore(
- host="localhost",
+ host=host,
  username="",
  password="",
  index="document",
@@ -129,6 +137,8 @@ We can use the `EmbeddingRetriever` for this purpose and specify a model that we
 
 
 ```python
+from haystack.nodes import EmbeddingRetriever
+
 retriever = EmbeddingRetriever(
  document_store=document_store,
  embedding_model="sentence-transformers/all-MiniLM-L6-v2",
@@ -143,8 +153,11 @@ Here: We download some question-answer pairs related to COVID-19
 
 
 ```python
+import pandas as pd
+
 from haystack.utils import fetch_archive_from_http
 
+
 # Download
 doc_dir = "data/tutorial4"
 s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/small_faq_covid.csv.zip"