Skip to content

Commit

Permalink
Docs v0.7.0 (deepset-ai#757)
Browse files Browse the repository at this point in the history
* new docs version

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
  • Loading branch information
PiffPaffM and github-actions[bot] committed Jan 22, 2021
1 parent 5081542 commit aee90c5
Show file tree
Hide file tree
Showing 89 changed files with 7,739 additions and 106 deletions.
196 changes: 98 additions & 98 deletions docs/_src/api/api/document_store.md

Large diffs are not rendered by default.

45 changes: 42 additions & 3 deletions docs/_src/api/api/preprocessor.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,11 +79,11 @@ the parameters passed into PreProcessor.__init__(). Takes a single document as i
<a name="utils"></a>
# Module utils

<a name="utils.eval_data_from_file"></a>
#### eval\_data\_from\_file
<a name="utils.eval_data_from_json"></a>
#### eval\_data\_from\_json

```python
eval_data_from_file(filename: str, max_docs: Union[int, bool] = None) -> Tuple[List[Document], List[Label]]
eval_data_from_json(filename: str, max_docs: Union[int, bool] = None, preprocessor: PreProcessor = None) -> Tuple[List[Document], List[Label]]
```

Read Documents + Labels from a SQuAD-style file.
Expand All @@ -98,6 +98,29 @@ Document and Labels can then be indexed to the DocumentStore and be used for eva

(List of Documents, List of Labels)

<a name="utils.eval_data_from_jsonl"></a>
#### eval\_data\_from\_jsonl

```python
eval_data_from_jsonl(filename: str, batch_size: Optional[int] = None, max_docs: Union[int, bool] = None, preprocessor: PreProcessor = None) -> Generator[Tuple[List[Document], List[Label]], None, None]
```

Read Documents + Labels from a SQuAD-style file in jsonl format, i.e. one document per line.
Document and Labels can then be indexed to the DocumentStore and be used for evaluation.

This is a generator which will yield one tuple per iteration containing a list
of batch_size documents and a list with the documents' labels.
If batch_size is set to None, this method will yield all documents and labels.

**Arguments**:

- `filename`: Path to file in SQuAD format
- `max_docs`: This sets the number of documents that will be loaded. By default, this is set to None, thus reading in all available eval documents.

**Returns**:

(List of Documents, List of Labels)

<a name="utils.convert_files_to_dicts"></a>
#### convert\_files\_to\_dicts

Expand Down Expand Up @@ -162,6 +185,22 @@ Fetch an archive (zip or tar.gz) from a url via http and extract content to an o

bool if anything got fetched

<a name="utils.squad_json_to_jsonl"></a>
#### squad\_json\_to\_jsonl

```python
squad_json_to_jsonl(squad_file: str, output_file: str)
```

Converts a SQuAD-json-file into jsonl format with one document per line.

**Arguments**:

- `squad_file`: SQuAD-file in json format.
:type squad_file: str
- `output_file`: Name of output file (SQuAD in jsonl format)
:type output_file: str

<a name="cleaning"></a>
# Module cleaning

Expand Down
20 changes: 15 additions & 5 deletions docs/_src/tutorials/tutorials/9.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,22 @@ Haystack contains all the tools needed to train your own Dense Passage Retrieval
This tutorial will guide you through the steps required to create a retriever that is specifically tailored to your domain.


```python
# Install the latest release of Haystack in your own environment
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install git+https://github.com/deepset-ai/haystack.git
!pip install torch==1.6.0+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html
```


```python
# Here are some imports that we'll need

from haystack.retriever.dense import DensePassageRetriever
from haystack.preprocessor.utils import fetch_archive_from_http
from haystack.document_store.memory import InMemoryDocumentStore
```

## Training Data
Expand All @@ -47,17 +57,17 @@ In some datasets, queries might have more than one positive context
in which case you can set the `num_positives` parameter to be higher than the default 1.
Note that `num_positives` needs to be lower or equal to the minimum number of `positive_ctxs` for queries in your data.
If you have an unequal number of positive contexts per example,
you might want to generate some soft labels retrieving similar contexts which contain the answer.
you might want to generate some soft labels by retrieving similar contexts which contain the answer.

DPR is standardly trained using a method known as in-batch negatives.
This means that positive contexts for given query are treated as negative contexts for the other queries in the batch.
This means that positive contexts for a given query are treated as negative contexts for the other queries in the batch.
Doing so allows for a high degree of computational efficiency, thus allowing the model to be trained on large amounts of data.

`negative_ctxs` is not actually used in Haystack's DPR training so we recommend you set it to an empty list.
They were used by the original DPR authors in an experiment to compare it against the in-batch negatives method.

`hard_negative_ctxs` are passages that are not relevant to the query.
In the original DPR paper, these are fetched using a retriever to find the most similar passages to the positive passage.
In the original DPR paper, these are fetched using a retriever to find the most relevant passages to the query.
Passages which contain the answer text are filtered out.

We are [currently working](https://github.com/deepset-ai/haystack/issues/705) on a script that will convert SQuAD format data into a DPR dataset!
Expand Down Expand Up @@ -153,7 +163,7 @@ for their max passage length but set max query length to 64 since queries are ve
## Initialize DPR model

retriever = DensePassageRetriever(
document_store=None,
document_store=InMemoryDocumentStore(),
query_embedding_model=query_model,
passage_embedding_model=passage_model,
max_seq_len_query=64,
Expand Down Expand Up @@ -187,7 +197,7 @@ average_rank: 0.07075978511128166

retriever.train(
data_dir=doc_dir,
train_filename=dev_filename,
train_filename=train_filename,
dev_filename=dev_filename,
test_filename=dev_filename,
n_epochs=1,
Expand Down
26 changes: 26 additions & 0 deletions docs/v0.7.0/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.

SPHINXBUILD := sphinx-build
MAKEINFO := makeinfo

BUILDDIR := build
SOURCE := _src/
# SPHINXFLAGS := -a -W -n -A local=1 -d $(BUILDDIR)/doctree
SPHINXFLAGS := -A local=1 -d $(BUILDDIR)/doctree
SPHINXOPTS := $(SPHINXFLAGS) $(SOURCE)

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
$(SPHINXBUILD) -M $@ $(SPHINXOPTS) $(BUILDDIR)/$@

20 changes: 20 additions & 0 deletions docs/v0.7.0/_src/api/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
29 changes: 29 additions & 0 deletions docs/v0.7.0/_src/api/_static/floating_sidebar.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
div.sphinxsidebarwrapper {
position: relative;
top: 0px;
padding: 0;
}

div.sphinxsidebar {
margin: 0;
padding: 0 15px 0 15px;
width: 210px;
float: left;
font-size: 1em;
text-align: left;
}

div.sphinxsidebar .logo {
font-size: 1.8em;
color: #0A507A;
font-weight: 300;
text-align: center;
}

div.sphinxsidebar .logo img {
vertical-align: middle;
}

div.sphinxsidebar .download a img {
vertical-align: middle;
}
46 changes: 46 additions & 0 deletions docs/v0.7.0/_src/api/_templates/xxlayout.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
{# put the sidebar before the body #}
{% block sidebar1 %}{{ sidebar() }}{% endblock %}
{% block sidebar2 %}{% endblock %}

{% block extrahead %}
<link href='https://fonts.googleapis.com/css?family=Open+Sans:300,400,700'
rel='stylesheet' type='text/css' />
{{ super() }}
{#- if not embedded #}
<style type="text/css">
table.right { float: left; margin-left: 20px; }
table.right td { border: 1px solid #ccc; }
{% if pagename == 'index' %}
.related { display: none; }
{% endif %}
</style>
<script>
// intelligent scrolling of the sidebar content
$(window).scroll(function() {
var sb = $('.sphinxsidebarwrapper');
var win = $(window);
var sbh = sb.height();
var offset = $('.sphinxsidebar').position()['top'];
var wintop = win.scrollTop();
var winbot = wintop + win.innerHeight();
var curtop = sb.position()['top'];
var curbot = curtop + sbh;
// does sidebar fit in window?
if (sbh < win.innerHeight()) {
// yes: easy case -- always keep at the top
sb.css('top', $u.min([$u.max([0, wintop - offset - 10]),
$(document).height() - sbh - 200]));
} else {
// no: only scroll if top/bottom edge of sidebar is at
// top/bottom edge of window
if (curtop > wintop && curbot > winbot) {
sb.css('top', $u.max([wintop - offset - 10, 0]));
} else if (curtop < wintop && curbot < winbot) {
sb.css('top', $u.min([winbot - sbh - offset - 20,
$(document).height() - sbh - 200]));
}
}
});
</script>
{#- endif #}
{% endblock %}
Loading

0 comments on commit aee90c5

Please sign in to comment.