Docs v0.7.0 (deepset-ai#757)

* new docs version * Add latest docstring and tutorial changes Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
jamescalam · Jan 22, 2021 · aee90c5 · aee90c5
1 parent 5081542
commit aee90c5
Show file tree

Hide file tree

Showing 89 changed files with 7,739 additions and 106 deletions.
diff --git a/docs/_src/api/api/document_store.md b/docs/_src/api/api/document_store.md
diff --git a/docs/_src/api/api/preprocessor.md b/docs/_src/api/api/preprocessor.md
@@ -79,11 +79,11 @@ the parameters passed into PreProcessor.__init__(). Takes a single document as i
 <a name="utils"></a>
 # Module utils
 
-<a name="utils.eval_data_from_file"></a>
-#### eval\_data\_from\_file
+<a name="utils.eval_data_from_json"></a>
+#### eval\_data\_from\_json
 
 ```python
-eval_data_from_file(filename: str, max_docs: Union[int, bool] = None) -> Tuple[List[Document], List[Label]]
+eval_data_from_json(filename: str, max_docs: Union[int, bool] = None, preprocessor: PreProcessor = None) -> Tuple[List[Document], List[Label]]
 ```
 
 Read Documents + Labels from a SQuAD-style file.
@@ -98,6 +98,29 @@ Document and Labels can then be indexed to the DocumentStore and be used for eva
 
 (List of Documents, List of Labels)
 
+<a name="utils.eval_data_from_jsonl"></a>
+#### eval\_data\_from\_jsonl
+
+```python
+eval_data_from_jsonl(filename: str, batch_size: Optional[int] = None, max_docs: Union[int, bool] = None, preprocessor: PreProcessor = None) -> Generator[Tuple[List[Document], List[Label]], None, None]
+```
+
+Read Documents + Labels from a SQuAD-style file in jsonl format, i.e. one document per line.
+Document and Labels can then be indexed to the DocumentStore and be used for evaluation.
+
+This is a generator which will yield one tuple per iteration containing a list
+of batch_size documents and a list with the documents' labels.
+If batch_size is set to None, this method will yield all documents and labels.
+
+**Arguments**:
+
+- `filename`: Path to file in SQuAD format
+- `max_docs`: This sets the number of documents that will be loaded. By default, this is set to None, thus reading in all available eval documents.
+
+**Returns**:
+
+(List of Documents, List of Labels)
+
 <a name="utils.convert_files_to_dicts"></a>
 #### convert\_files\_to\_dicts
 
@@ -162,6 +185,22 @@ Fetch an archive (zip or tar.gz) from a url via http and extract content to an o
 
 bool if anything got fetched
 
+<a name="utils.squad_json_to_jsonl"></a>
+#### squad\_json\_to\_jsonl
+
+```python
+squad_json_to_jsonl(squad_file: str, output_file: str)
+```
+
+Converts a SQuAD-json-file into jsonl format with one document per line.
+
+**Arguments**:
+
+- `squad_file`: SQuAD-file in json format.
+:type squad_file: str
+- `output_file`: Name of output file (SQuAD in jsonl format)
+:type output_file: str
+
 <a name="cleaning"></a>
 # Module cleaning
 

diff --git a/docs/_src/tutorials/tutorials/9.md b/docs/_src/tutorials/tutorials/9.md
@@ -15,12 +15,22 @@ Haystack contains all the tools needed to train your own Dense Passage Retrieval
 This tutorial will guide you through the steps required to create a retriever that is specifically tailored to your domain.
 
 
+```python
+# Install the latest release of Haystack in your own environment
+#! pip install farm-haystack
+
+# Install the latest master of Haystack
+!pip install git+https://github.com/deepset-ai/haystack.git
+!pip install torch==1.6.0+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html
+```
+
 
 ```python
 # Here are some imports that we'll need
 
 from haystack.retriever.dense import DensePassageRetriever
 from haystack.preprocessor.utils import fetch_archive_from_http
+from haystack.document_store.memory import InMemoryDocumentStore
 ```
 
 ## Training Data
@@ -47,17 +57,17 @@ In some datasets, queries might have more than one positive context
 in which case you can set the `num_positives` parameter to be higher than the default 1.
 Note that `num_positives` needs to be lower or equal to the minimum number of `positive_ctxs` for queries in your data.
 If you have an unequal number of positive contexts per example,
-you might want to generate some soft labels retrieving similar contexts which contain the answer.
+you might want to generate some soft labels by retrieving similar contexts which contain the answer.
 
 DPR is standardly trained using a method known as in-batch negatives.
-This means that positive contexts for given query are treated as negative contexts for the other queries in the batch.
+This means that positive contexts for a given query are treated as negative contexts for the other queries in the batch.
 Doing so allows for a high degree of computational efficiency, thus allowing the model to be trained on large amounts of data.
 
 `negative_ctxs` is not actually used in Haystack's DPR training so we recommend you set it to an empty list.
 They were used by the original DPR authors in an experiment to compare it against the in-batch negatives method.
 
 `hard_negative_ctxs` are passages that are not relevant to the query.
-In the original DPR paper, these are fetched using a retriever to find the most similar passages to the positive passage.
+In the original DPR paper, these are fetched using a retriever to find the most relevant passages to the query.
 Passages which contain the answer text are filtered out.
 
 We are [currently working](https://github.com/deepset-ai/haystack/issues/705) on a script that will convert SQuAD format data into a DPR dataset!
@@ -153,7 +163,7 @@ for their max passage length but set max query length to 64 since queries are ve
 ## Initialize DPR model
 
 retriever = DensePassageRetriever(
- document_store=None,
+ document_store=InMemoryDocumentStore(),
  query_embedding_model=query_model,
  passage_embedding_model=passage_model,
  max_seq_len_query=64,
@@ -187,7 +197,7 @@ average_rank: 0.07075978511128166
 
 retriever.train(
  data_dir=doc_dir,
- train_filename=dev_filename,
+ train_filename=train_filename,
  dev_filename=dev_filename,
  test_filename=dev_filename,
  n_epochs=1,

diff --git a/docs/v0.7.0/Makefile b/docs/v0.7.0/Makefile
@@ -0,0 +1,26 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+
+SPHINXBUILD := sphinx-build
+MAKEINFO := makeinfo
+
+BUILDDIR := build
+SOURCE := _src/
+# SPHINXFLAGS := -a -W -n -A local=1 -d $(BUILDDIR)/doctree
+SPHINXFLAGS := -A local=1 -d $(BUILDDIR)/doctree
+SPHINXOPTS := $(SPHINXFLAGS) $(SOURCE)
+
+# Put it first so that "make" without argument is like "make help".
+help:
+ @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+ $(SPHINXBUILD) -M $@ $(SPHINXOPTS) $(BUILDDIR)/$@
+
diff --git a/docs/v0.7.0/_src/api/Makefile b/docs/v0.7.0/_src/api/Makefile
@@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS ?=
+SPHINXBUILD ?= sphinx-build
+SOURCEDIR = .
+BUILDDIR = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+ @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+ @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/v0.7.0/_src/api/_static/floating_sidebar.css b/docs/v0.7.0/_src/api/_static/floating_sidebar.css
@@ -0,0 +1,29 @@
+div.sphinxsidebarwrapper {
+ position: relative;
+ top: 0px;
+ padding: 0;
+}
+
+div.sphinxsidebar {
+ margin: 0;
+ padding: 0 15px 0 15px;
+ width: 210px;
+ float: left;
+ font-size: 1em;
+ text-align: left;
+}
+
+div.sphinxsidebar .logo {
+ font-size: 1.8em;
+ color: #0A507A;
+ font-weight: 300;
+ text-align: center;
+}
+
+div.sphinxsidebar .logo img {
+ vertical-align: middle;
+}
+
+div.sphinxsidebar .download a img {
+ vertical-align: middle;
+}
diff --git a/docs/v0.7.0/_src/api/_templates/xxlayout.html b/docs/v0.7.0/_src/api/_templates/xxlayout.html
@@ -0,0 +1,46 @@
+{# put the sidebar before the body #}
+{% block sidebar1 %}{{ sidebar() }}{% endblock %}
+{% block sidebar2 %}{% endblock %}
+
+{% block extrahead %}
+ <link href='https://fonts.googleapis.com/css?family=Open+Sans:300,400,700'
+ rel='stylesheet' type='text/css' />
+{{ super() }}
+{#- if not embedded #}
+ <style type="text/css">
+ table.right { float: left; margin-left: 20px; }
+ table.right td { border: 1px solid #ccc; }
+ {% if pagename == 'index' %}
+ .related { display: none; }
+ {% endif %}
+ </style>
+ <script>
+ // intelligent scrolling of the sidebar content
+ $(window).scroll(function() {
+ var sb = $('.sphinxsidebarwrapper');
+ var win = $(window);
+ var sbh = sb.height();
+ var offset = $('.sphinxsidebar').position()['top'];
+ var wintop = win.scrollTop();
+ var winbot = wintop + win.innerHeight();
+ var curtop = sb.position()['top'];
+ var curbot = curtop + sbh;
+ // does sidebar fit in window?
+ if (sbh < win.innerHeight()) {
+ // yes: easy case -- always keep at the top
+ sb.css('top', $u.min([$u.max([0, wintop - offset - 10]),
+ $(document).height() - sbh - 200]));
+ } else {
+ // no: only scroll if top/bottom edge of sidebar is at
+ // top/bottom edge of window
+ if (curtop > wintop && curbot > winbot) {
+ sb.css('top', $u.max([wintop - offset - 10, 0]));
+ } else if (curtop < wintop && curbot < winbot) {
+ sb.css('top', $u.min([winbot - sbh - offset - 20,
+ $(document).height() - sbh - 200]));
+ }
+ }
+ });
+ </script>
+{#- endif #}
+{% endblock %}