Skip to content

Commit

Permalink
make tutorial 08 testable (deepset-ai#15)
Browse files Browse the repository at this point in the history
* make tutorial 08 testable

* use unstable image
  • Loading branch information
masci committed Sep 16, 2022
1 parent f6c123f commit 9d46ceb
Show file tree
Hide file tree
Showing 4 changed files with 84 additions and 190 deletions.
4 changes: 3 additions & 1 deletion .github/workflows/nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ on:
jobs:
run-tutorials:
runs-on: ubuntu-latest
container: deepset/haystack:base-massi-docker
container: deepset/haystack:base-cpu-main

services:
elasticsearch:
Expand All @@ -28,6 +28,7 @@ jobs:
- 05_Evaluation
- 06_Better_Retrieval_via_Embedding_Retrieval
- 07_RAG_Generator
- 08_Preprocessing
- 10_Knowledge_Graph
- 11_Pipelines
- 12_LFQA
Expand All @@ -52,4 +53,5 @@ jobs:
- name: Run the converted notebook
run: |
echo "/opt" >> $GITHUB_PATH
python ./tutorials/${{ matrix.notebook }}.py
3 changes: 2 additions & 1 deletion .github/workflows/run_tutorials.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ on:
jobs:
run-tutorials:
runs-on: ubuntu-latest
container: deepset/haystack:base-massi-docker
container: deepset/haystack:base-cpu-main

services:
elasticsearch:
Expand Down Expand Up @@ -53,4 +53,5 @@ jobs:
# Note: the `+` at the end of the `find` invocation will make it fail if any
# of the execs failed, otherwise `find` returns 0 even when the execs fail.
run: |
echo "/opt" >> $GITHUB_PATH
find ./tutorials -name "*.py" -execdir python {} +;
38 changes: 19 additions & 19 deletions markdowns/8.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,20 +28,18 @@ docs = [
This tutorial will show you all the tools that Haystack provides to help you cast your data into this format.


```python
# Install the latest release of Haystack in your own environment
#! pip install farm-haystack
```bash
%%bash

# Install the latest main of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,ocr]
pip install --upgrade pip
pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,ocr]

# For Colab/linux based machines
!wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.04.tar.gz
# For Colab/linux based machines:
!wget https://dl.xpdfreader.com/xpdf-tools-linux-4.04.tar.gz
!tar -xvf xpdf-tools-linux-4.04.tar.gz && sudo cp xpdf-tools-linux-4.04/bin64/pdftotext /usr/local/bin

# For Macos machines
# !wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-mac-4.03.tar.gz
# For macOS machines:
# !wget https://dl.xpdfreader.com/xpdf-tools-mac-4.03.tar.gz
# !tar -xvf xpdf-tools-mac-4.03.tar.gz && sudo cp xpdf-tools-mac-4.03/bin64/pdftotext /usr/local/bin
```

Expand All @@ -62,15 +60,10 @@ logging.getLogger("haystack").setLevel(logging.INFO)


```python
# Here are the imports we need
from haystack.nodes import TextConverter, PDFToTextConverter, DocxToTextConverter, PreProcessor
from haystack.utils import convert_files_to_docs, fetch_archive_from_http
```
from haystack.utils import fetch_archive_from_http


```python
# This fetches some sample files to work with

doc_dir = "data/tutorial8"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/preprocessing_tutorial8.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
Expand All @@ -81,11 +74,12 @@ fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
Haystack's converter classes are designed to help you turn files on your computer into the documents
that can be processed by the Haystack pipeline.
There are file converters for txt, pdf, docx files as well as a converter that is powered by Apache Tika.
The parameter `valid_languages` does not convert files to the target language, but checks if the conversion worked as expected.
The parameter `valid_languages` does not convert files to the target language, but checks if the conversion worked as expected. Here are some examples of how you would use file converters:


```python
# Here are some examples of how you would use file converters
from haystack.nodes import TextConverter, PDFToTextConverter, DocxToTextConverter, PreProcessor


converter = TextConverter(remove_numeric_tables=True, valid_languages=["en"])
doc_txt = converter.convert(file_path="data/tutorial8/classics.txt", meta=None)[0]
Expand All @@ -97,9 +91,12 @@ converter = DocxToTextConverter(remove_numeric_tables=False, valid_languages=["e
doc_docx = converter.convert(file_path="data/tutorial8/heavy_metal.docx", meta=None)[0]
```

Haystack also has a convenience function that will automatically apply the right converter to each file in a directory:


```python
# Haystack also has a convenience function that will automatically apply the right converter to each file in a directory.
from haystack.utils import convert_files_to_docs


all_docs = convert_files_to_docs(dir_path=doc_dir)
```
Expand All @@ -115,6 +112,9 @@ and [Optimization](https://haystack.deepset.ai/docs/latest/optimizationmd) pages


```python
from haystack.nodes import PreProcessor


# This is a default usage of the PreProcessor.
# Here, it performs cleaning of consecutive whitespaces
# and splits a single large document into smaller documents.
Expand Down
Loading

0 comments on commit 9d46ceb

Please sign in to comment.