Added max_seq_length and batch_size params to embeddingretriever #1817

AhmedIdr · 2021-11-26T11:36:24Z

Added the option to choose max_seq_length and batch_size when using EmbeddingRetriever.
Related to #1793
For RetriBERT I am not totally sure if I added the params in the right places.
Additionally, added a progress_bar to Faiss writing_documents method, since it is hard to keep track with the progress when working with large datasets.
Related to #1110

I don't usually make any pull requests, so I hope didn't mess anything up.

Status (please check what you already did):

First draft (up for discussions & feedback)
Final code
Added tests
Updated documentation

…ss writing_documents

julian-risch · 2021-11-26T12:09:25Z

haystack/nodes/retriever/_embedding_encoder.py

@@ -55,7 +55,7 @@ def __init__(
 retriever.embedding_model, revision=retriever.model_version, task_type="embeddings",
 extraction_strategy=retriever.pooling_strategy,
 extraction_layer=retriever.emb_extraction_layer, gpu=retriever.use_gpu,
- batch_size=4, max_seq_len=512, num_processes=0,use_auth_token=retriever.use_auth_token
+ batch_size=retriever.batch_size, max_seq_len=retriever.max_seq_len,, num_processes=0,use_auth_token=retriever.use_auth_token


There is a superfluous , here.

julian-risch · 2021-11-26T12:11:50Z

Hi @AhmedIdr thanks for raising this pull request. I had a first brief look at your code and started our tests. That way, I identified a syntax error with a superfluous comma. Could you please fix that so that we can run the tests? Thank you! I will review the code more thoroughly once the tests are running. 👍

…o add-params

AhmedIdr · 2021-11-26T12:55:04Z

Hey @julian-risch,
Sorry, I didn't notice the typos before creating the pull request. I tested the changes on another machine, so I didn't pay attention to some typos when copy and pasting. Hope everything is working now.

julian-risch · 2021-11-26T14:53:02Z

docs/_src/api/api/retriever.md

@@ -611,7 +611,7 @@ class EmbeddingRetriever(BaseRetriever)
 #### \_\_init\_\_

 ```python
- | __init__(document_store: BaseDocumentStore, embedding_model: str, model_version: Optional[str] = None, use_gpu: bool = True, model_format: str = "farm", pooling_strategy: str = "reduce_mean", emb_extraction_layer: int = -1, top_k: int = 10, progress_bar: bool = True, devices: Optional[List[Union[int, str, torch.device]]] = None, use_auth_token: Optional[Union[str,bool]] = None)
+ | __init__(document_store: BaseDocumentStore, embedding_model: str, model_version: Optional[str] = None, use_gpu: bool = True, batch_size: int = 16, max_seq_len: int = 128, model_format: str = "farm", pooling_strategy: str = "reduce_mean", emb_extraction_layer: int = -1, top_k: int = 10, progress_bar: bool = True, devices: Optional[List[Union[int, str, torch.device]]] = None, use_auth_token: Optional[Union[str,bool]] = None)


I can see that the previous default max_seq_len was 512. Is there any particular reason why you would prefer 128 as the default? If not I'd say we keep 512. (The change in the default value is probably also why some test cases fail right now but I could fix that if we decide to use 128 instead of 512.)

No there was no reason, we can set it back to 512 and batch size back to 32.

Changed default batch_size and max_seq_len in EmbeddingRetriever

julian-risch · 2021-11-29T15:04:03Z

Hi @AhmedIdr the code looks good to me so far. Thanks for changing the default parameters. Right now, I tried your changes in this colab notebook: https://colab.research.google.com/github/AhmedIdr/haystack/blob/add-params/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb but I couldn't see the progress bar when running write_documents(). Have you tried that in a notebook?

AhmedIdr · 2021-11-29T15:17:35Z

Hey @julian-risch, no, I didn't, I only tried it using a script. But it is basically implemented similar to the other methods and like it was suggested in #1110. I'll try to run it in a notebook and see if I figure out what the problem is.

Change import tqdm.auto to tqdm

AhmedIdr · 2021-11-29T16:48:34Z

Changing from tqdm.auto import tqdm back to from tqdm import tqdm seems to solve the problem.

julian-risch · 2021-11-29T17:57:39Z

My mistake, I didn't change the pip install statement in the notebook to your branch. It needs to be !pip install git+https://github.com/AhmedIdr/haystack.git@add-params 😅 or even !pip install git+https://github.com/AhmedIdr/haystack.git@09f89bc if you would like to check your previous commit from 2 days ago.

AhmedIdr · 2021-11-29T18:11:16Z

Ahh I see, so using tqdm.auto is also working. Do you prefer to change it back to tqdm.auto or keep tqdm?

julian-risch · 2021-11-29T18:11:48Z

Ahh I see, so using tqdm.auto is also working. Do you prefer to change it back to tqdm.auto or keep tqdm?

Yes, please change it back to tqdm.auto. Thank you.

Changing tqdm back to tqdm.auto

AhmedIdr · 2021-11-29T18:17:45Z

Okay done, hope nothing went wrong when fetching the new commits from the main branch 😅

julian-risch

Looks good to me and ready to merge. 👍
Thank you so much for your contribution to haystack! Having your pull request merged is a great achievement.

AhmedIdr · 2021-11-29T18:51:53Z

Thank you and sorry it took a bit of time for everything to work properly 😅

AhmedIdr and others added 2 commits November 26, 2021 12:21

Added max_seq_length and batch_size params, added progress_bar to fai…

e664103

…ss writing_documents

Add latest docstring and tutorial changes

cabf644

julian-risch reviewed Nov 26, 2021

View reviewed changes

AhmedIdr added 2 commits November 26, 2021 13:47

fixed typos

222ad30

Merge branch 'add-params' of https://github.com/AhmedIdr/haystack int…

a5008ee

…o add-params

julian-risch reviewed Nov 26, 2021

View reviewed changes

AhmedIdr and others added 2 commits November 27, 2021 19:45

Update dense.py

09f89bc

Changed default batch_size and max_seq_len in EmbeddingRetriever

Add latest docstring and tutorial changes

2463d06

AhmedIdr added 2 commits November 29, 2021 17:13

Update faiss.py

66ff7b3

Change import tqdm.auto to tqdm

Merge branch 'deepset-ai:master' into add-params

13aa01a

AhmedIdr added 2 commits November 29, 2021 19:15

Merge branch 'deepset-ai:master' into add-params

89870cb

Update faiss.py

473dbab

Changing tqdm back to tqdm.auto

julian-risch approved these changes Nov 29, 2021

View reviewed changes

julian-risch merged commit 56e4e84 into deepset-ai:master Nov 29, 2021

tholor added the topic:modeling label Dec 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added max_seq_length and batch_size params to embeddingretriever #1817

Added max_seq_length and batch_size params to embeddingretriever #1817

AhmedIdr commented Nov 26, 2021 •

edited

Loading

julian-risch Nov 26, 2021

julian-risch commented Nov 26, 2021

AhmedIdr commented Nov 26, 2021 •

edited

Loading

julian-risch Nov 26, 2021 •

edited

Loading

AhmedIdr Nov 26, 2021

julian-risch commented Nov 29, 2021

AhmedIdr commented Nov 29, 2021 •

edited

Loading

AhmedIdr commented Nov 29, 2021 •

edited

Loading

julian-risch commented Nov 29, 2021 •

edited

Loading

AhmedIdr commented Nov 29, 2021

julian-risch commented Nov 29, 2021

AhmedIdr commented Nov 29, 2021

julian-risch left a comment

AhmedIdr commented Nov 29, 2021

Added max_seq_length and batch_size params to embeddingretriever #1817

Added max_seq_length and batch_size params to embeddingretriever #1817

Conversation

AhmedIdr commented Nov 26, 2021 • edited Loading

julian-risch Nov 26, 2021

Choose a reason for hiding this comment

julian-risch commented Nov 26, 2021

AhmedIdr commented Nov 26, 2021 • edited Loading

julian-risch Nov 26, 2021 • edited Loading

Choose a reason for hiding this comment

AhmedIdr Nov 26, 2021

Choose a reason for hiding this comment

julian-risch commented Nov 29, 2021

AhmedIdr commented Nov 29, 2021 • edited Loading

AhmedIdr commented Nov 29, 2021 • edited Loading

julian-risch commented Nov 29, 2021 • edited Loading

AhmedIdr commented Nov 29, 2021

julian-risch commented Nov 29, 2021

AhmedIdr commented Nov 29, 2021

julian-risch left a comment

Choose a reason for hiding this comment

AhmedIdr commented Nov 29, 2021

AhmedIdr commented Nov 26, 2021 •

edited

Loading

AhmedIdr commented Nov 26, 2021 •

edited

Loading

julian-risch Nov 26, 2021 •

edited

Loading

AhmedIdr commented Nov 29, 2021 •

edited

Loading

AhmedIdr commented Nov 29, 2021 •

edited

Loading

julian-risch commented Nov 29, 2021 •

edited

Loading