Skip to content

Commit

Permalink
deploy current version
Browse files Browse the repository at this point in the history
  • Loading branch information
PiffPaffM committed Jun 11, 2021
1 parent 791e637 commit afd670d
Show file tree
Hide file tree
Showing 9 changed files with 283 additions and 13 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,27 @@ from haystack.document_store import SQLDocumentStore
document_store = SQLDocumentStore()
```

</div>
</div>

<div class="tab">
<input type="radio" id="tab-1-6" name="tab-group-1">
<label class="labelouter" for="tab-1-6">Weaviate</label>
<div class="tabcontent">

The `WeaviateDocumentStore` requires a running Weaviate Server.
You can start a basic instance like this (see Weaviate docs for details):
```
docker run -d -p 8080:8080 --env AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED='true' --env PERSISTENCE_DATA_PATH='/var/lib/weaviate' semitechnologies/weaviate:1.4.0
```

Afterwards, you can use it in Haystack:
```python
from haystack.document_store import WeaviateDocumentStore

document_store = WeaviateDocumentStore()
```

</div>
</div>

Expand Down Expand Up @@ -264,6 +285,24 @@ The Document Stores have different characteristics. You should choose one depend
</div>
</div>


<div class="tab">
<input type="radio" id="tab-2-6" name="tab-group-2">
<label class="labelouter" for="tab-2-6">Weaviate</label>
<div class="tabcontent">

**Pros:**
- Simple vector search
- Stores everything in one place: documents, meta data and vectors - so less network overhead when scaling this up
- Allows combination of vector search and scalar filtering, i.e. you can filter for a certain tag and do dense retrieval on that subset

**Cons:**
- Less options for ANN algorithms than FAISS or Milvus
- No BM25 / Tf-idf retrieval

</div>
</div>

</div>

<div class="recommendation">
Expand All @@ -276,4 +315,4 @@ The Document Stores have different characteristics. You should choose one depend

**Vector Specialist:** Use the `MilvusDocumentStore`, if you want to focus on dense retrieval and possibly deal with larger datasets

</div>
</div>
90 changes: 90 additions & 0 deletions src/pages/docs/versions/master/latest/site/en/usage/usage/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
---
title: "Frequently Asked Questions"
metaTitle: "Frequently Asked Questions"
metaDescription: ""
slug: "/docs/faq"
date: "2020-09-03"
id: "faqmd"
---

#Frequently Asked Questions

##Why am I seeing duplicate answers being returned?

The ElasticsearchDocumentStore and MilvusDocumentStore rely on Elasticsearch and Milvus backend services which
persist after your Python script has finished running.
If you rerun your script without deleting documents, you could end up with duplicate
copies of your documents in your database.
The easiest way to avoid this is to call `DocumentStore.delete_documents()` after initialization
to ensure that you are working with an empty DocumentStore.

DocumentStores also have a `duplicate_documents` argument in their `__init__()` and `write_documents` methods
where you can define whether you'd like skip writing duplicates, overwrite existing duplicates or raise an error when there are duplicates.

##How can I make sure that my GPU is being engaged when I use Haystack?

You will want to ensure that a CUDA enabled GPU is being engaged when Haystack is running (you can check by running `nvidia-smi -l` on your command line).
Components which can be sped up by GPU have a `use_gpu` argument in their constructor which you will want to set to `True`.

##How do I speed up my predictions?

There are many different ways to speed up the performance of your Haystack system.

The Reader is usually the most computationally expensive component in a pipeline
and you can often speed up your system by using a smaller model, like `deepset/minilm-uncased-squad2` (see [benchmarks](https://huggingface.co/deepset/minilm-uncased-squad2)). This usually comes with a small trade-off in accuracy.

You can reduce the work load on the Reader by instructing the Retriever to pass on less documents.
This is done by setting the `top_k_retriever` parameter to a lower value.

Making sure that your documents are shorter can also increase the speed of your system. You can split
your documents into smaller chunks by using the `PreProcessor` (see [tutorial](https://haystack.deepset.ai/docs/latest/tutorial11md)).

For more optimization suggestions, have a look at our [optimization page](https://haystack.deepset.ai/docs/latest/optimizationmd)
and also our [blogs](https://medium.com/deepset-ai)

##How do I use Haystack for my language?

The components in Haystack, such as the `Retriever` or the `Reader`, are designed in a language agnostic way. However you may
have to set certain parameters or load models pretrained for your language in order to get good performance out of Haystack.
See our [languages page](https://haystack.deepset.ai/docs/latest/languagesmd) for more details.

##How can I add metadata to my documents so that I can apply filters?

When providing your documents in the input format (see [here](https://haystack.deepset.ai/docs/latest/documentstoremd#Input-Format))
you can provide metadata information as a dictionary under the `meta` key. At query time, you can provide a `filters` argument
(most likely through `Pipelines.run()`) that specifies the accepted values for a certain metadata field
(for an example of what a `filters` dictionary might look like, please refer to [this example](https://haystack.deepset.ai/docs/latest/apiretrievermd#__init__))

##How can I see predictions during evaluation?

To see predictions during evaluation, you want to initialize the `EvalDocuments` or `EvalAnswers` with `debug=True`.
This causes their `EvalDocuments.log` or `EvalAnswers.log` to be populated with a record of each prediction made.

##How can I serve my Haystack model?

Haystack models can be wrapped in a REST API. For basic details on how to set this up, please refer to this section
on our [Github page](https://github.com/deepset-ai/haystack/blob/master/README.md#7-rest-api).
More comprehensive documentation coming soon!

##How can I interpret the confidence scores being returned by the Reader?

The confidence scores are in the range of 0 and 1 and reflect how confident the model is in each prediction that it makes.
Having a confidence score is particularly useful in cases where you need Haystack to work with a certain accuracy threshold.
Many of our users have built systems where predictions below a certain confidence value are routed on to a fallback system.

For more information on model confidence and how to tune it, please refer to [this section](https://haystack.deepset.ai/docs/latest/readermd#Confidence-Scores).

##My documents aren't showing up in my DocumentStore even though I've called `DocumentStore.write_documents()`

When indexing, retrieving or querying for documents from a DocumentStore, you can specify an `index` on which to perform this action.
This can be specified in almost all methods of `DocumentStore` as well as `Retriever.retrieve()`.
Ensure that you are performing these operations on the one index!
Note that this also applies at evaluation where labels are written into their own separate DocumentStore index.

##What is the difference between the FARMReader and the TransformersReader?

In short, the FARMReader using a QA pipeline implementation that comes from our own
[FARM framework](https://github.com/deepset-ai/FARM) that we can more easily update and also optimize for performance.
By contrast, the TransformersReader uses a QA pipeline implementation that comes from HuggingFace's [Transformers](https://github.com/huggingface/transformers).
See [this section](https://haystack.deepset.ai/docs/latest/readermd#Deeper-Dive-FARM-vs-Transformers)
for a more details about their differences!
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Alternatively, [this example](https://github.com/deepset-ai/FARM/blob/master/exa
### Description

The FARMRanker consists of a Transformer-based model for document re-ranking using the TextPairClassifier of [FARM](https://github.com/deepset-ai/FARM).
Given a text pair of query and passage, the TextPairClassifier either predicts label "1" if the pair is similar or label "0" if they are dissimilar (accompanied with a probability).
While the underlying model can vary (BERT, Roberta, DistilBERT, ...), the interface remains the same.
With a FARMRanker, you can:
* Directly get predictions (re-ranked version of the supplied list of Document) via predict() if supplying a pre-trained model
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -247,7 +247,7 @@ When printing the full results of a Reader,
you will see that each prediction is accompanied
by a value in the range of 0 to 1 reflecting the model's confidence in that prediction

In the output of `print_answers()`, you will find the model confidence in dictionary key called `probability`.
In the output of `print_answers()`, you will find the model confidence in dictionary key called `confidence`.

```python
from haystack.utils import print_answers
Expand All @@ -263,17 +263,22 @@ print_answers(prediction, details="all")
'She travels with her father, Eddard, to '
"King's Landing when he is made Hand of the "
'King. Before she leaves,',
'probability': 0.9899835586547852,
'confidence': 0.9899835586547852,
...
},
]
}
```

In order to align this probability score with the model's accuracy, finetuning needs to be performed
on a specific dataset. Have a look at this [FARM tutorial](https://github.com/deepset-ai/FARM/blob/master/examples/question_answering_confidence.py)
to see how this is done.
Note that a finetuned confidence score is specific to the domain that its finetuned on.
on a specific dataset.
To this end, the reader has a method `calibrate_confidence_scores(document_store, device, label_index, doc_index, label_origin)`.
The parameters of this method are the same as for the `eval()` method because the calibration of confidence scores is performed on a dataset that comes with gold labels.
The calibration calls the `eval()` method internally and therefore needs a DocumentStore containing labeled questions and evaluation documents.

Have a look at this [FARM tutorial](https://github.com/deepset-ai/FARM/blob/master/examples/question_answering_confidence.py)
to see how to compare calibrated confidence scores with uncalibrated confidence scores within FARM.
Note that a finetuned confidence score is specific to the domain that it is finetuned on.
There is no guarantee that this performance can transfer to a new domain.

Having a confidence score is particularly useful in cases where you need Haystack to work with a certain accuracy threshold.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,27 @@ from haystack.document_store import SQLDocumentStore
document_store = SQLDocumentStore()
```

</div>
</div>

<div class="tab">
<input type="radio" id="tab-1-6" name="tab-group-1">
<label class="labelouter" for="tab-1-6">Weaviate</label>
<div class="tabcontent">

The `WeaviateDocumentStore` requires a running Weaviate Server.
You can start a basic instance like this (see Weaviate docs for details):
```
docker run -d -p 8080:8080 --env AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED='true' --env PERSISTENCE_DATA_PATH='/var/lib/weaviate' semitechnologies/weaviate:1.4.0
```

Afterwards, you can use it in Haystack:
```python
from haystack.document_store import WeaviateDocumentStore

document_store = WeaviateDocumentStore()
```

</div>
</div>

Expand Down Expand Up @@ -264,6 +285,24 @@ The Document Stores have different characteristics. You should choose one depend
</div>
</div>


<div class="tab">
<input type="radio" id="tab-2-6" name="tab-group-2">
<label class="labelouter" for="tab-2-6">Weaviate</label>
<div class="tabcontent">

**Pros:**
- Simple vector search
- Stores everything in one place: documents, meta data and vectors - so less network overhead when scaling this up
- Allows combination of vector search and scalar filtering, i.e. you can filter for a certain tag and do dense retrieval on that subset

**Cons:**
- Less options for ANN algorithms than FAISS or Milvus
- No BM25 / Tf-idf retrieval

</div>
</div>

</div>

<div class="recommendation">
Expand All @@ -276,4 +315,4 @@ The Document Stores have different characteristics. You should choose one depend

**Vector Specialist:** Use the `MilvusDocumentStore`, if you want to focus on dense retrieval and possibly deal with larger datasets

</div>
</div>
90 changes: 90 additions & 0 deletions src/pages/docs/versions/master/site/en/usage/usage/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
---
title: "Frequently Asked Questions"
metaTitle: "Frequently Asked Questions"
metaDescription: ""
slug: "/docs/faq"
date: "2020-09-03"
id: "faqmd"
---

#Frequently Asked Questions

##Why am I seeing duplicate answers being returned?

The ElasticsearchDocumentStore and MilvusDocumentStore rely on Elasticsearch and Milvus backend services which
persist after your Python script has finished running.
If you rerun your script without deleting documents, you could end up with duplicate
copies of your documents in your database.
The easiest way to avoid this is to call `DocumentStore.delete_documents()` after initialization
to ensure that you are working with an empty DocumentStore.

DocumentStores also have a `duplicate_documents` argument in their `__init__()` and `write_documents` methods
where you can define whether you'd like skip writing duplicates, overwrite existing duplicates or raise an error when there are duplicates.

##How can I make sure that my GPU is being engaged when I use Haystack?

You will want to ensure that a CUDA enabled GPU is being engaged when Haystack is running (you can check by running `nvidia-smi -l` on your command line).
Components which can be sped up by GPU have a `use_gpu` argument in their constructor which you will want to set to `True`.

##How do I speed up my predictions?

There are many different ways to speed up the performance of your Haystack system.

The Reader is usually the most computationally expensive component in a pipeline
and you can often speed up your system by using a smaller model, like `deepset/minilm-uncased-squad2` (see [benchmarks](https://huggingface.co/deepset/minilm-uncased-squad2)). This usually comes with a small trade-off in accuracy.

You can reduce the work load on the Reader by instructing the Retriever to pass on less documents.
This is done by setting the `top_k_retriever` parameter to a lower value.

Making sure that your documents are shorter can also increase the speed of your system. You can split
your documents into smaller chunks by using the `PreProcessor` (see [tutorial](https://haystack.deepset.ai/docs/latest/tutorial11md)).

For more optimization suggestions, have a look at our [optimization page](https://haystack.deepset.ai/docs/latest/optimizationmd)
and also our [blogs](https://medium.com/deepset-ai)

##How do I use Haystack for my language?

The components in Haystack, such as the `Retriever` or the `Reader`, are designed in a language agnostic way. However you may
have to set certain parameters or load models pretrained for your language in order to get good performance out of Haystack.
See our [languages page](https://haystack.deepset.ai/docs/latest/languagesmd) for more details.

##How can I add metadata to my documents so that I can apply filters?

When providing your documents in the input format (see [here](https://haystack.deepset.ai/docs/latest/documentstoremd#Input-Format))
you can provide metadata information as a dictionary under the `meta` key. At query time, you can provide a `filters` argument
(most likely through `Pipelines.run()`) that specifies the accepted values for a certain metadata field
(for an example of what a `filters` dictionary might look like, please refer to [this example](https://haystack.deepset.ai/docs/latest/apiretrievermd#__init__))

##How can I see predictions during evaluation?

To see predictions during evaluation, you want to initialize the `EvalDocuments` or `EvalAnswers` with `debug=True`.
This causes their `EvalDocuments.log` or `EvalAnswers.log` to be populated with a record of each prediction made.

##How can I serve my Haystack model?

Haystack models can be wrapped in a REST API. For basic details on how to set this up, please refer to this section
on our [Github page](https://github.com/deepset-ai/haystack/blob/master/README.md#7-rest-api).
More comprehensive documentation coming soon!

##How can I interpret the confidence scores being returned by the Reader?

The confidence scores are in the range of 0 and 1 and reflect how confident the model is in each prediction that it makes.
Having a confidence score is particularly useful in cases where you need Haystack to work with a certain accuracy threshold.
Many of our users have built systems where predictions below a certain confidence value are routed on to a fallback system.

For more information on model confidence and how to tune it, please refer to [this section](https://haystack.deepset.ai/docs/latest/readermd#Confidence-Scores).

##My documents aren't showing up in my DocumentStore even though I've called `DocumentStore.write_documents()`

When indexing, retrieving or querying for documents from a DocumentStore, you can specify an `index` on which to perform this action.
This can be specified in almost all methods of `DocumentStore` as well as `Retriever.retrieve()`.
Ensure that you are performing these operations on the one index!
Note that this also applies at evaluation where labels are written into their own separate DocumentStore index.

##What is the difference between the FARMReader and the TransformersReader?

In short, the FARMReader using a QA pipeline implementation that comes from our own
[FARM framework](https://github.com/deepset-ai/FARM) that we can more easily update and also optimize for performance.
By contrast, the TransformersReader uses a QA pipeline implementation that comes from HuggingFace's [Transformers](https://github.com/huggingface/transformers).
See [this section](https://haystack.deepset.ai/docs/latest/readermd#Deeper-Dive-FARM-vs-Transformers)
for a more details about their differences!
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Alternatively, [this example](https://github.com/deepset-ai/FARM/blob/master/exa
### Description

The FARMRanker consists of a Transformer-based model for document re-ranking using the TextPairClassifier of [FARM](https://github.com/deepset-ai/FARM).
Given a text pair of query and passage, the TextPairClassifier either predicts label "1" if the pair is similar or label "0" if they are dissimilar (accompanied with a probability).
While the underlying model can vary (BERT, Roberta, DistilBERT, ...), the interface remains the same.
With a FARMRanker, you can:
* Directly get predictions (re-ranked version of the supplied list of Document) via predict() if supplying a pre-trained model
Expand Down
Loading

0 comments on commit afd670d

Please sign in to comment.