Merge pull request deepset-ai#164 from deepset-ai/v0.10

Docs V0.10
ju-gu · Sep 23, 2021 · 1f87136 · 1f87136
2 parents babe575 + e4f8323
commit 1f87136
Show file tree

Hide file tree

Showing 9 changed files with 328 additions and 45 deletions.
diff --git a/docs/latest/components/classifier.mdx b/docs/latest/components/classifier.mdx
@@ -0,0 +1,43 @@
+# Classifier
+
+The Classifier Node is a transformer based classification model used to create predictions that can be attached to retrieved documents as metadata.
+For example, by using a sentiment model, you can label each document as being either positive or negative in sentiment.
+Through a tight integration with the HuggingFace model hub, you can easily load any classification model by simply supplying the model name.
+
+![image](/img/classifier.png)
+
+<div className="max-w-xl bg-yellow-light-theme border-l-8 border-yellow-dark-theme px-6 pt-6 pb-4 my-4 rounded-md dark:bg-yellow-900">
+
+Note that the Classifier is different from the Query Classifier.
+While the Query Classifier categorizes incoming queries in order to route them to different parts of the pipeline,
+the Classifier is used to create classification labels that can be attached to retrieved documents as metadata.
+
+</div>
+
+## Usage
+
+Initialize it as follows:
+
+``` python
+from haystack.classifier import FARMClassifier
+
+classifier_model = 'textattack/bert-base-uncased-imdb'
+classifier = FARMClassifier(model_name_or_path=classifier_model)
+```
+
+It slotted into a pipeline as follows:
+
+``` python
+pipeline = Pipeline()
+pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
+pipeline.add_node(component=classifier, name='Classifier', inputs=['Retriever'])
+```
+
+It can also be run in isolation:
+
+``` python
+documents = classifier.predict(
+ query="",
+ documents = [doc1, doc2, doc3, ...]
+):
+```
diff --git a/docs/latest/components/document_store.mdx b/docs/latest/components/document_store.mdx
@@ -29,12 +29,48 @@ docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.9.2
 Next you can initialize the Haystack object that will connect to this instance.
 
 ```python
+from haystack.document_store import ElasticSearchDocumentStore
+
 document_store = ElasticsearchDocumentStore()
 ```
 
-Note that we also support [Open Distro for Elasticsearch](https://opendistro.github.io/for-elasticsearch-docs/).
-Follow [their documentation](https://opendistro.github.io/for-elasticsearch-docs/docs/install/)
-to run it and connect to it using Haystack's `OpenDistroElasticsearchDocumentStore` class.
+### Open Distro for Elasticsearch
+
+Learn how to get started [here](https://opendistro.github.io/for-elasticsearch-docs/#get-started)
+
+If you have Docker set up, we recommend pulling the Docker image and running it.
+
+```bash
+docker pull amazon/opendistro-for-elasticsearch:1.13.2
+docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" amazon/opendistro-for-elasticsearch:1.13.2
+```
+
+Next you can initialize the Haystack object that will connect to this instance.
+
+```python
+from haystack.document_store import OpenDistroElasticsearchDocumentStore
+
+document_store = OpenDistroElasticsearchDocumentStore()
+```
+
+### OpenSearch
+
+Learn how to get started [here](https://opensearch.org/docs/#docker-quickstart)
+
+If you have Docker set up, we recommend pulling the Docker image and running it.
+
+```bash
+docker pull opensearchproject/opensearch:1.0.1
+docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:1.0.1
+```
+
+Next you can initialize the Haystack object that will connect to this instance.
+
+```bash
+from haystack.document_store import OpenSearchDocumentStore
+
+document_store = OpenSearchDocumentStore()
+```
 
 <div style={{ marginBottom: "3rem" }} />
 
@@ -210,12 +246,34 @@ The Document Stores have different characteristics. You should choose one depend
 - Fast & accurate sparse retrieval with many tuning options
 - Basic support for dense retrieval
 - Production-ready
-- Support also for Open Distro
 
 **Cons:**
 
 - Slow for dense retrieval with more than ~ 1 Mio documents
 
+### Open Distro for Elasticsearch
+
+**Pros:**
+
+- Fully open source (Apache 2.0 license)
+- Essentially the same features as Elasticsearch
+
+**Cons:**
+
+- Slow for dense retrieval with more than ~ 1 Mio documents
+
+### OpenSearch
+
+**Pros:**
+
+- Fully open source (Apache 2.0 license)
+- Essentially the same features as Elasticsearch
+- Has more support for vector similarity comparisons and approximate nearest neighbours algorithms
+
+**Cons:**
+
+- Not as optimized as dedicated vector similarity options like Milvus and FAISS
+
 <div style={{ marginBottom: "3rem" }} />
 
 ### Milvus

diff --git a/docs/latest/components/generator.mdx b/docs/latest/components/generator.mdx
@@ -2,27 +2,61 @@
 
 While extractive QA highlights the span of text that answers a query,
 generative QA can return a novel text answer that it has composed.
+
 The best current approaches, such as [Retriever-Augmented Generation](https://arxiv.org/abs/2005.11401) and [LFQA](https://yjernite.github.io/lfqa.html),
 can draw upon both the knowledge it gained during language model pretraining (parametric memory)
 as well as passages provided to it with a retriever (non-parametric memory).
+
 With the advent of Transformer based retrieval methods such as [Dense Passage Retrieval](https://arxiv.org/abs/2004.04906),
 retriever and generator can be trained concurrently from the one loss signal.
 
 <div className="max-w-xl bg-yellow-light-theme border-l-8 border-yellow-dark-theme px-6 pt-6 pb-4 my-4 rounded-md dark:bg-yellow-900">
 
-**Tutorial**
-
-Checkout our tutorial notebooks for a guide on how to build your own generative QA system with RAG ([here](/tutorials/retrieval-augmented-generation))
+**Tutorial:** Checkout our tutorial notebooks for a guide on how to build your own generative QA system with RAG ([here](/tutorials/retrieval-augmented-generation))
 or with LFQA ([here](/tutorials/pipelines)).
 
 </div>
 
-Pros
+**Pros**
 
 - More appropriately phrased answers
 - Able to synthesize information from different texts
 - Can draw on latent knowledge stored in language model
 
-Cons
+**Cons**
 
 - Not easy to track what piece of information the generator is basing its response off of
+
+## Usage
+
+Initialize a Generator as follows:
+
+``` python
+from haystack.generator.transformers import RAGenerator
+
+generator = RAGenerator(
+ model_name_or_path="facebook/rag-sequence-nq",
+ retriever=dpr_retriever,
+ top_k=1,
+ min_length=2
+)
+```
+
+Running a Generator in a pipeline:
+
+``` python
+from haystack.pipeline import GenerativeQAPipeline
+
+pipeline = GenerativeQAPipeline(generator=generator, retriever=dpr_retriever)
+result = pipelines.run(query='What are the best party games for adults?', top_k_retriever=20)
+```
+
+Running a stand-alone Generator:
+
+``` python
+result = generator.predict(
+ query='What are the best party games for adults?',
+ documents=[doc1, doc2, doc3...],
+ top_k=top_k
+)
+```
diff --git a/docs/latest/components/preprocessing.mdx b/docs/latest/components/preprocessing.mdx
@@ -50,6 +50,17 @@ Please refer to [the API docs](/reference/file-converters) to see which converte
  valid_languages=["de","en"])
  </code>
  <code>doc = converter.convert(file_path=file, meta=None)</code>
+ <code>
+ # Alternatively, if you have a PDF containing images, Haystack uses tessaract under the hood to OCR image PDFs.
+ </code>
+ <code>
+ from haystack.file_converter import PDFToTextOCRConverter
+ </code>
+ <code>
+ converter = PDFToTextOCRConverter(remove_numeric_tables=False,
+ valid_languages=["deu","eng"])
+ </code>
+ <code>doc = converter.convert(file_path=file, meta=None)</code>
  </pre>
  ),
  },
@@ -71,7 +82,7 @@ Please refer to [the API docs](/reference/file-converters) to see which converte
  content: (
  <div>
  <p>
- Haystack also has a`convert_files_to_dicts()` utility function that
+ Haystack also has a `convert_files_to_dicts()` utility function that
  will convert all txt or pdf files in a given folder into this
  dictionary format.
  </p>
@@ -84,6 +95,26 @@ Please refer to [the API docs](/reference/file-converters) to see which converte
  </div>
  ),
  },
+ {
+ title: "Image",
+ content: (
+ <div>
+ <p>
+ Haystack supports extraction of text from images using OCR.
+ </p>
+ <pre>
+ <code>
+ from haystack.file_converter import ImageToTextConverter
+ </code>
+ <code>
+ converter = ImageToTextConverter(remove_numeric_tables=True,
+ valid_languages=["de","en"])
+ </code>
+ <code>doc = converter.convert(file_path=file, meta=None)</code>
+ </pre>
+ </div>
+ ),
+ },
  ]}
 />
 

diff --git a/docs/latest/components/ready_made_pipelines.mdx b/docs/latest/components/ready_made_pipelines.mdx
@@ -43,7 +43,7 @@ We typically pass the output of the Retriever to another component such as the R
 
 `DocumentSearchPipeline` wraps the [Retriever](/components/retriever) into a pipeline. Note that this wrapper does not endow the Retrievers with additional functionality but instead allows them to be used consistently with other Haystack Pipeline objects and with the same familiar syntax. Creating this pipeline is as simple as passing the Retriever into the pipeline’s constructor:
 
-```python
+``` python
 pipeline = DocumentSearchPipeline(retriever=retriever)
 
 query = "Tell me something about that time when they play chess."
@@ -128,7 +128,7 @@ result = pipeline.run(query=query, params={"retriever": {"top_k": 10}, "reader":
 
 You may access the answer and other information like the model’s confidence and original context via the `answers` key, in this manner:
 
-```python
+``` python
 result["answers"]
 >>> [{'answer': 'der Klang der Musik',
  'score': 9.269367218017578,
@@ -209,4 +209,33 @@ Output:
  ],
  ...
  }
+```
+
+## MostSimilarDocumentsPipeline
+
+This pipeline is used to find the most similar documents to a given document in your document store.
+
+You will need to first make sure that your indexed documents have attached embeddings.
+You can generate and store their embeddings using the `DocumentStore.update_embeddings()` method.
+
+``` python
+from haystack.pipeline import MostSimilarDocumentsPipeline
+
+msd_pipeline = MostSimilarDocumentsPipeline(document_store)
+result = msd_pipeline.run(document_ids=[doc_id1, doc_id2, ...])
+print(result)
+```
+
+Output:
+
+``` python
+[[
+ {'text': "Southern California's economy is diver...",
+ 'score': 0.8605178832348279,
+ 'question': None,
+ 'meta': {'name': 'Southern_California'},
+ 'embedding': ...,
+ 'id': '6e26b1b78c48efc6dd6c888e72d0970b'},
+ ...
+]]
 ```