Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace FARM import statements; add dependencies #1492

Merged
merged 29 commits into from
Sep 28, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
158197a
Replace FARM import statements; add dependencies
julian-risch Sep 22, 2021
59cb039
Add latest docstring and tutorial changes
github-actions[bot] Sep 22, 2021
4d48086
Add InferenceProc., TextCl.Proc., TextPairCl.Proc.
julian-risch Sep 23, 2021
566e5fa
Merge branch 'farm_merging_dependencies' of github.com:deepset-ai/hay…
julian-risch Sep 23, 2021
1dab474
Add latest docstring and tutorial changes
github-actions[bot] Sep 23, 2021
bfbbd38
Remove FARMRanker, add type annotations, rename max_sample
julian-risch Sep 23, 2021
4d31195
Add sample_to_features_text for InferenceProc.
julian-risch Sep 23, 2021
d577149
Merge branch 'farm_merging_dependencies' of github.com:deepset-ai/hay…
julian-risch Sep 23, 2021
908f674
Add latest docstring and tutorial changes
github-actions[bot] Sep 23, 2021
5b77873
Fix type annotations: model_name_or_path is str not Path
julian-risch Sep 23, 2021
29c5a61
Merge branch 'farm_merging_dependencies' of github.com:deepset-ai/hay…
julian-risch Sep 23, 2021
a7760bb
Add latest docstring and tutorial changes
github-actions[bot] Sep 23, 2021
23f26f7
Fix mypy errors: implement _create_dataset in TextCl.Proc.
julian-risch Sep 23, 2021
114e5da
Merge branch 'farm_merging_dependencies' of github.com:deepset-ai/hay…
julian-risch Sep 23, 2021
35a3cd8
Correct formatting of comments
julian-risch Sep 23, 2021
0d2aa5f
Remove empty line to prevent line.strip()[0] == "#" IndexError
julian-risch Sep 23, 2021
4cea167
Add task_type "embeddings" in Inferencer
julian-risch Sep 23, 2021
12460b0
Allow loading AdaptiveModel for embedding task
julian-risch Sep 23, 2021
15583f3
Add SQuAD eval metrics; enable InferenceProc for embedding task
julian-risch Sep 23, 2021
23e2d29
Add baskets as param to log_samples
julian-risch Sep 23, 2021
036d655
Handle empty basket list in log_samples
julian-risch Sep 23, 2021
f0eb6ea
Remove unused dependencies
julian-risch Sep 23, 2021
5375ce9
Remove FARMClassifier (doc classificer) due to ref to TextClassificat…
julian-risch Sep 23, 2021
9def673
Merge branch 'master' into farm_merging_dependencies
julian-risch Sep 23, 2021
5590cba
Remove FARMRanker and Classifier from doc generation scripts
julian-risch Sep 23, 2021
7278005
Merge branch 'farm_merging_dependencies' of github.com:deepset-ai/hay…
julian-risch Sep 23, 2021
1711d95
Add latest docstring and tutorial changes
github-actions[bot] Sep 23, 2021
5306b14
Merge branch 'master' into farm_merging_dependencies: Test Refactoring
julian-risch Sep 27, 2021
251e72f
Fix import statements and type annotations
julian-risch Sep 28, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
199 changes: 0 additions & 199 deletions docs/_src/api/api/classifier.md

This file was deleted.

1 change: 0 additions & 1 deletion docs/_src/api/api/generate_docstrings.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,5 +17,4 @@ pydoc-markdown pydoc-markdown-graph-retriever.yml
pydoc-markdown pydoc-markdown-evaluation.yml
pydoc-markdown pydoc-markdown-ranker.yml
pydoc-markdown pydoc-markdown-question-generator.yml
pydoc-markdown pydoc-markdown-classifier.yml

18 changes: 0 additions & 18 deletions docs/_src/api/api/pydoc-markdown-classifier.yml

This file was deleted.

2 changes: 1 addition & 1 deletion docs/_src/api/api/pydoc-markdown-ranker.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
loaders:
- type: python
search_path: [../../../../haystack/ranker]
modules: ['base', 'farm']
modules: ['base', 'sentence_transformers']
ignore_when_discovered: ['__init__']
processor:
- type: filter
Expand Down
117 changes: 19 additions & 98 deletions docs/_src/api/api/ranker.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,130 +51,51 @@ position in the ranking of documents the correct document is.
- `return_preds`: Whether to add predictions in the returned dictionary. If True, the returned dictionary
contains the keys "predictions" and "metrics".

<a name="farm"></a>
# Module farm
<a name="sentence_transformers"></a>
# Module sentence\_transformers

<a name="farm.FARMRanker"></a>
## FARMRanker Objects
<a name="sentence_transformers.SentenceTransformersRanker"></a>
## SentenceTransformersRanker Objects

```python
class FARMRanker(BaseRanker)
class SentenceTransformersRanker(BaseRanker)
```

Transformer based model for Document Re-ranking using the TextPairClassifier of FARM framework (https://github.com/deepset-ai/FARM).
Sentence Transformer based pre-trained Cross-Encoder model for Document Re-ranking (https://huggingface.co/cross-encoder).
Re-Ranking can be used on top of a retriever to boost the performance for document search. This is particularly useful if the retriever has a high recall but is bad in sorting the documents by relevance.
While the underlying model can vary (BERT, Roberta, DistilBERT, ...), the interface remains the same.
FARMRanker handles Cross-Encoder models that internally use two logits and output the classifier's probability of label "1" as similarity score.
This includes TextPairClassification models trained within FARM.
In contrast, SentenceTransformersRanker handles Cross-Encoder models that use a single logit as similarity score.

SentenceTransformerRanker handles Cross-Encoder models that use a single logit as similarity score.
https://www.sbert.net/docs/pretrained-models/ce-msmarco.html#usage-with-transformers
In contrast, FARMRanker handles Cross-Encoder models that internally use two logits and output the classifier's probability of label "1" as similarity score.
This includes TextPairClassification models trained within FARM.

| With a FARMRanker, you can:
| With a SentenceTransformersRanker, you can:
- directly get predictions via predict()
- fine-tune the model on TextPair data via train()

Usage example:
...
retriever = ElasticsearchRetriever(document_store=document_store)
ranker = FARMRanker(model_name_or_path="deepset/gbert-base-germandpr-reranking")
ranker = SentenceTransformersRanker(model_name_or_path="cross-encoder/ms-marco-MiniLM-L-12-v2")
p = Pipeline()
p.add_node(component=retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=ranker, name="Ranker", inputs=["ESRetriever"])

<a name="farm.FARMRanker.__init__"></a>
<a name="sentence_transformers.SentenceTransformersRanker.__init__"></a>
#### \_\_init\_\_

```python
| __init__(model_name_or_path: Union[str, Path], model_version: Optional[str] = None, batch_size: int = 50, use_gpu: bool = True, top_k: int = 10, num_processes: Optional[int] = None, max_seq_len: int = 256, progress_bar: bool = True)
| __init__(model_name_or_path: Union[str, Path], model_version: Optional[str] = None, top_k: int = 10)
```

**Arguments**:

- `model_name_or_path`: Directory of a saved model or the name of a public model e.g. 'bert-base-cased',
'deepset/bert-base-cased-squad2', 'deepset/bert-base-cased-squad2', 'distilbert-base-uncased-distilled-squad'.
See https://huggingface.co/models for full list of available models.
- `model_name_or_path`: Directory of a saved model or the name of a public model e.g.
'cross-encoder/ms-marco-MiniLM-L-12-v2'.
See https://huggingface.co/cross-encoder for full list of available models
- `model_version`: The version of model to use from the HuggingFace model hub. Can be tag name, branch name, or commit hash.
- `batch_size`: Number of samples the model receives in one batch for inference.
Memory consumption is much lower in inference mode. Recommendation: Increase the batch size
to a value so only a single batch is used.
- `use_gpu`: Whether to use GPU (if available)
- `top_k`: The maximum number of documents to return
- `num_processes`: The number of processes for `multiprocessing.Pool`. Set to value of 0 to disable
multiprocessing. Set to None to let Inferencer determine optimum number. If you
want to debug the Language Model, you might need to disable multiprocessing!
- `max_seq_len`: Max sequence length of one input text for the model
- `progress_bar`: Whether to show a tqdm progress bar or not.
Can be helpful to disable in production deployments to keep the logs clean.

<a name="farm.FARMRanker.train"></a>
#### train

```python
| train(data_dir: str, train_filename: str, dev_filename: Optional[str] = None, test_filename: Optional[str] = None, use_gpu: Optional[bool] = None, batch_size: int = 10, n_epochs: int = 2, learning_rate: float = 1e-5, max_seq_len: Optional[int] = None, warmup_proportion: float = 0.2, dev_split: float = 0, evaluate_every: int = 300, save_dir: Optional[str] = None, num_processes: Optional[int] = None, use_amp: str = None)
```

Fine-tune a model on a TextPairClassification dataset. Options:

- Take a plain language model (e.g. `bert-base-cased`) and train it for TextPairClassification
- Take a TextPairClassification model and fine-tune it for your domain

**Arguments**:

- `data_dir`: Path to directory containing your training data
- `train_filename`: Filename of training data
- `dev_filename`: Filename of dev / eval data
- `test_filename`: Filename of test data
- `dev_split`: Instead of specifying a dev_filename, you can also specify a ratio (e.g. 0.1) here
that gets split off from training data for eval.
- `use_gpu`: Whether to use GPU (if available)
- `batch_size`: Number of samples the model receives in one batch for training
- `n_epochs`: Number of iterations on the whole training data set
- `learning_rate`: Learning rate of the optimizer
- `max_seq_len`: Maximum text length (in tokens). Everything longer gets cut down.
- `warmup_proportion`: Proportion of training steps until maximum learning rate is reached.
Until that point LR is increasing linearly. After that it's decreasing again linearly.
Options for different schedules are available in FARM.
- `evaluate_every`: Evaluate the model every X steps on the hold-out eval dataset
- `save_dir`: Path to store the final model
- `num_processes`: The number of processes for `multiprocessing.Pool` during preprocessing.
Set to value of 1 to disable multiprocessing. When set to 1, you cannot split away a dev set from train set.
Set to None to use all CPU cores minus one.
- `use_amp`: Optimization level of NVIDIA's automatic mixed precision (AMP). The higher the level, the faster the model.
Available options:
None (Don't use AMP)
"O0" (Normal FP32 training)
"O1" (Mixed Precision => Recommended)
"O2" (Almost FP16)
"O3" (Pure FP16).
See details on: https://nvidia.github.io/apex/amp.html

**Returns**:

None

<a name="farm.FARMRanker.update_parameters"></a>
#### update\_parameters

```python
| update_parameters(max_seq_len: Optional[int] = None)
```

Hot update parameters of a loaded Ranker. It may not to be safe when processing concurrent requests.

<a name="farm.FARMRanker.save"></a>
#### save

```python
| save(directory: Path)
```

Saves the Ranker model so that it can be reused at a later point in time.

**Arguments**:

- `directory`: Directory where the Ranker model should be saved

<a name="farm.FARMRanker.predict_batch"></a>
<a name="sentence_transformers.SentenceTransformersRanker.predict_batch"></a>
#### predict\_batch

```python
Expand All @@ -195,7 +116,7 @@ Returns list of dictionary of query and list of document sorted by (desc.) simil

List of dictionaries containing query and ranked list of Document

<a name="farm.FARMRanker.predict"></a>
<a name="sentence_transformers.SentenceTransformersRanker.predict"></a>
#### predict

```python
Expand Down
Loading